Overview

  • Adjacency matrix, basic manipulation
  • Network non-parametric approaches
  • Ego-nets
  • Network regression

Adjacency matrix

Basic definitions

For a set of nodes \(V=\{1,\ldots,n\}\) we define a set of ties as being un-ordered pairs of nodes, \(\{i,j\} \in \{i,j\in V : i\neq j\}\). We can denote the set of all undordered pairs \({V}\choose{2}\), meaning that the ties or edges are two-element subsets of \(V\), and the set of ties \(E \subseteq {{V}\choose{2}}\).

Adjacency matrix

For all \(\{i,j\} \in {{V}\choose{2}}\), we define the tie-variables \(x_{ij}\) as being indicators \[ \begin{equation*} x_{ij}= \left\{ \begin{array}{lr} 1,&\text{if there is an edge between } i \text{ and } j\\ 0,&\text{else} \end{array} \right. \end{equation*} \]

We collect these in an \(n \times n\) adjacency matrix \[ X = \begin{bmatrix} x_{11} & x_{12} & x_{13} & \cdots & x_{1n}\\ x_{21} & x_{22} & x_{23} & \cdots & x_{2n}\\ x_{31} & x_{32} & x_{33} & \cdots & x_{3n}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \cdots & x_{2n}\\ \end{bmatrix} \]

Objectives

This is an introduction to social networks using built-in functions in R and the packages sna and network. We will learn the

  • basic features of the adjacency matrix represented as a matrix object,
    • calculate the degrees of the nodes, and
    • calculate some fundamental descriptives of the network.
  • We will then translate the network
    • from a matrix object to a network object in order to
    • plot the sociogram.

For full details of the packages, see https://cran.r-project.org/web/packages/sna/sna.pdf and https://cran.r-project.org/web/packages/network/network.pdf. For accessile general R-help see https://www.statmethods.net/ and for any kind of errors use https://google.com.

This introduction is deliberately writen in inelegant R, using as basic functions as possible. Many packages offer sleeker and more userfriendly network routines, such as ‘igraph’. In particular, I would like to recomend the packages of David Schooch http://mr.schochastics.net/ for accessible and elegant network analysis in R. In general, basic plots in R (described in https://www.statmethods.net/graphs/index.html) are functional but more advanced and better looking plots can be acchieved through ‘ggplot’.

For basic concepts in network analysis see G. Robins (2015) and Borgatti, Everett, and Johnson (2018). There is also a handy online bool http://faculty.ucr.edu/~hanneman/nettext/ (Hanneman and Riddle 2005).

Build your own network

To use sna(Butts 2016) and network (Butts 2015) for the first time (uncomment the install commmands), install the packages

# install.packages("sna")
# install.packages("network")

Once packages are installed, load them

library("sna")
library("network")

The Matrix

Create an empty adjacency matrix for n = 5 nodes

n <- 5
ADJ <- matrix(0,n,n) # create a matrix with n rows and n columns and all values 0

Add ties \(1 \rightarrow 2\), \(1 \rightarrow 3\), \(2 \rightarrow 3\), \(3 \rightarrow 4\), and , \(4 \rightarrow 5\)

ADJ[1,2] <- 1
ADJ[1,3] <- 1
ADJ[2,3] <- 1
ADJ[3,4] <- 1
ADJ[4,5] <- 1
ADJ

To make the network undirected, add the ties \(2 \rightarrow 1\), \(3 \rightarrow 1\), \(3 \rightarrow 2\), \(4 \rightarrow 3\), and \(5 \rightarrow 4\)

ADJ[2,1] <- 1
ADJ[3,1] <- 1
ADJ[3,2] <- 1
ADJ[4,3] <- 1
ADJ[5,4] <- 1
ADJ
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    1    0    0
## [2,]    1    0    1    0    0
## [3,]    1    1    0    1    0
## [4,]    0    0    1    0    1
## [5,]    0    0    0    1    0

Cells in the adjacency matrix and tie-variables

In general the cell ADJ[i,j] corresponds to the tie-variable \(X_{i,j}\). Here \(x_{1,2}=1\)

ADJ[1,2]

but, for example, \(x_{1,4}=0\)

ADJ[1,4]

The ties of node \(i=1\) is the \(i\)’th row

ADJ[1,]

Density

The adjcenacy matrix has

dim(ADJ)

rows and columns. This means that there are \(n \times n\) cells in the adjacency matrix.

dim(ADJ)[1]*dim(ADJ)[2]
length(ADJ)

The \(n\) diagonal elements \(x_{11},x_{22},\ldots,x_{nn}\) are zero by definition, which means that there are \(n \times n - n = n(n-1)\) variables that can be non-zero, here

dim(ADJ)[1]*dim(ADJ)[2] - n

Density: How many variables are equal to 1 out of the total posible?

The total number of ones \[L = \sum_{i,j,i\neq j}x_{ij}=x_{12}+\cdots+x_{1n}+x_{21}+\cdots+x_{(n-1)n}\] is simply a count of the number of non-zero entries

sum(ADJ)

The density thus is

sum(ADJ)/(n*(n-1))

and 50% of possible ties are present in the network.

Degree

How many ties does a node have?

The degree \(d_i\) of a node \(i\) is defined as the sum \(d_i=\sum_{j}x_{i,j}=x_{i,2}+x_{i,2}+\cdots + x_{i,n}\). The degree of node \(i=1\) is thus

sum(ADJ[1,])

and the degree of node \(i=2\) is

sum(ADJ[2,])

Degree distribution

Calculate the column sum of the adjacency matrix to get the vector of degrees (note the capital S)

colSums(ADJ)

The degree distribution is the table of frequencies of degrees

table( colSums(ADJ) )

You can chart the degree distribution with a bar chart

plot( table( colSums(ADJ) ))

You can use standard R-routines to explore the adjacency matrix

For example finding what node (-s) have, say, degree 3

which(colSums(ADJ)==3)

Or subsetting the adjacency matrix to look only at nodes with degree 2 or greater

use <- which(colSums(ADJ)>=2) # for each row there will be a logical TRUE or FALSE
ADJ[use,use]

Fun Fact: Linear algebra

Most network metrics can be calculated using linear algebra. For example, if \(X_{i,j}\) in \(X\) tell you if \(i\) and \(j\) are directly connected, element \((XX)_{i,j}\) of the matrix product \(XX\), tells you how many paths \(i \rightarrow k \rightarrow j\) there are

ADJ %*% ADJ

Element \((XXX)_{i,j}\) of the matrix product \(XXX\), tells you how many paths \(i \rightarrow k \rightarrow h \rightarrow j\) there are

ADJ %*% ADJ %*% ADJ

Network object

Plotting the matrix object ADJ is not meaningful because R does not know that this is an adjacency matrix. To interpret ADJ as a network, translate the adjacency matrix to a network object

net <- as.network(ADJ, directed = FALSE)

NB: in the network package you use directed=FALSE in lieu of setting mode equal to graph.

The new object net is an object of type

class(net)

While printing ADJ to screen just gives you the matrix, priniting net gives you a summary of the network

net

Plot sociogram

When plotting a network object, R knows that you want to plot the sociogram

plot( net )

For various plotting option see ?plot.network. For example, set node-size to degree, include labels, and set different colours

plot( net , # the network object
      vertex.cex = degree(net) , # how should nodes (vertices) be scaled
      displaylabels =  TRUE, # display the labels of vertices
      vertex.col = c('red','blue','grey','green','yellow'))

Note that degree(net) is a built-in function in network for calculating the degrees of the nodes. The next step will explore more of these functions.

Network position

Continue with the network that we constructed previously in Minimal Intro.

Recall, basic properties of this network are

net
##  Network attributes:
##   vertices = 5 
##   directed = FALSE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 5 
##     missing edges= 0 
##     non-missing edges= 5 
## 
##  Vertex attribute names: 
##     vertex.names 
## 
## No edge attributes

Degree centrality

What nodes have more ties?

degree( net )

This is vector with the degree centrality for each node in the network. Degree centrality is also called Freeman degree centrality (Freeman 1978).

Betweeness centrality

Betweeness is related to connectivity and flow in a network (Freeman 1978; Borgatti and Everett 2006). This measure requires that you know the network structure of the entire network.

betweenness( net , gmode ='graph' )

Node 3 is on the shortest path between 4 pairs of nodes:

Start node Step 2 Step 3 Step 4
1 3 4
1 3 4 5
2 3 4
2 3 4 5

These nodes will only be able to ‘communicate’ via 3

Brokers

A cutpoint is a node that when removed results in the network having more components. In simpler terms, a cutpoint is a node that connects otherwise dissconnected nodes

cutpoints( net , mode = "graph")
## [1] 3 4

The extent to which nodes broker between other nodes has been the fucus of a large part of Ron Burt’s career. He proposes some elaborations for measuring bokerness in Burt (2009) called constraint and effective size. These are local measures that do not, as opposed to cutpoints and betweeness, require that you know the network structure of the entire network.


Triad census

The triad census tabulates different types of triangles or tripples (J. A. Davis 1967; Holland and Leinhardt 1972)

0 edges 1 edge 2 edges 3 edges
\(\sharp\) tripels \(\sharp\) tripels \(\sharp\) tripels \(\sharp\) tripels
triad.census( net , mode = "graph")
##      0 1 2 3
## [1,] 0 6 3 1

There are 6 tripples, with one tie. For example the subgraph consisting of nodes 1,4 , and 5.

plot(as.network( ADJ[c(1,4,5),c(1,4,5)] , directed = FALSE),
     vertex.cex = degree(net)[c(1,4,5)],
     vertex.col = c('red','blue','grey','green','yellow')[c(1,4,5)])

Open triad: There are 3 tripples, with two ties. For example the subgraph consisting of nodes 3, 4 , and 5.

plot(as.network( ADJ[c(3,4,5),c(3,4,5)] , directed = FALSE),
     vertex.cex = degree(net)[c(3,4,5)],
     vertex.col = c('red','blue','grey','green','yellow')[c(3,4,5)])

Closed triad: There is 1 tripple, with three ties. For example the subgraph consisting of nodes 1, 2 , and 3.

plot(as.network( ADJ[c(1,2,3),c(1,2,3)] , directed = FALSE ),
     vertex.cex = degree(net)[c(1,2,3)],
     vertex.col = c('red','blue','grey','green','yellow')[c(1,2,3)] )

Cliques

A \(k\)-clique is a maximally connected subgraph. This means that a \(k\)-clique is a network with \(k\) nodes that are all connected to each other. In net nodes 5 and 4 consitute a 2-clique and nodes 1, 2, and 3 constitute a 3-clique. Like the triad census, we can calulate a clique census for a graph

cc <- clique.census( net , mode = "graph")

The object cc is of class

class(cc)

The list has the following objects

names(cc)

The object clique.count lists the membership of nodes in cliques

cc$clique.count

A list of lists of cliques and their members is provided in

cc$cliques

Note that subgraphs of cliques are not listed. For example, the 2-cliques with nodes 1 and 2 is not listed as both are part of the larger 3-clique.

Position directed

This is an introduction to anylisng directed networks in R using the packages sna and network. We will revisit the adjacency matrix and the notion of tie variables for directed networks.

We will learn the basic

  • differences between in- and out-degree
  • definitions of the basic subgraphs
    • dyads
    • triads

We will use a dataset of 73 high school pupils collected by James Coleman (1964) that comes with the package `sna.

For full details of the packages, see https://cran.r-project.org/web/packages/sna/sna.pdf and https://cran.r-project.org/web/packages/network/network.pdf.

The background of, and description, of the dataset is provided in the helpfile

?coleman

Now load the network

data(coleman)

The adjacency matrix

What do I have in my workspace and what did I load?

The function data() loaded something. Use the general purpose command ls() to list what is in your workspace

ls()

This is not one adjacency matrix but

class(coleman)

As described in the help file this is an array with 2 \(\times\) adjacency matrices of dimensions \(73 \times 73\)

dim(coleman)

The individual slices are matrices

class(coleman[1,,])
class(coleman[2,,])

You can print coleman[1,,] and coleman[2,,] to screen to view the adjacency matrices (but that will just be a lot of ones and zeros, in particular \(2 \times 73 \times 73\) ones and zeros). You can visualise the adjacency matrices in these matrix plots

par( mfrow = c(1,2))
plot.sociomatrix( coleman[1,,] , drawlab=FALSE , drawlines = FALSE, xlab = 'FALL')
plot.sociomatrix( coleman[2,,] , drawlab=FALSE , drawlines = FALSE, xlab = 'SPRING')

In the adjacency matrix, rows record the ties sent, and columns record the ties received. Student 1 has the out-tie variables \(x_{1,2},x_{1,3},\ldots,x_{1,n}\)

 coleman[1,1,]

For example, that coleman[1,1,14] is 1 means that student 1 has a tie to student 14, and that coleman[1,1,2] is 0 means that student 1 does not have a tie to student 2.

Outdegree

Taking the sum of the outties

 sum(coleman[1,1,])

gives us student 1’s outdegree, \(d_i^o= \sum_{j,j\neq i}x_{i,j}\).

To get the outdegree of all pupils, sum across columns for all rows

 rowSums(coleman[1,,])

And, like for un-directed networks, we can chart the the degree distribution

 plot( table (rowSums(coleman[1,,]) ))

Indegree

The ties a student has received is the column of that particular student. Student 1 has received the tie variables \(x_{2,1},x_{3,1},\ldots,x_{n,1}\)

 coleman[1,,1]

For example, that coleman[1,14,1] is 0 means that student 1 has not received a tie from student 14.

To get the indegree of all pupils sum across rows for all columns

 colSums(coleman[1,,])

And, like for un-directed networks, we can chart the the indegree distribution

 plot( table (colSums(coleman[1,,]) ))

The number of ties sent and received are not the same!

For example, student 1 has nominated 5 other people but student 1 has not been nominated back at all. Plotting the outdegrees against the indegrees

 plot( jitter(rowSums(coleman[1,,]) ) , jitter( colSums(coleman[1,,]) ) ) 

reveal that some send more than they receive, and some send fewer than they receive.

One reason for this is that some pairs are asymmetric and other pairs are symmetric in their choices (there are of course other possible explanations Igarashi and Koskinen 2020)

Dyads and reciprocity

A dyad is a pair of nodes and their ties \((X_{i,j}X_{j,i})\) to each other. To investigate whether pairs are symmetric or asymmetric, consider e.g. that 1 nominated 14 but 14 did not nominate 1 back - this pair is asymmetric.

 coleman[1,1,14]
 coleman[1,14,1]

The dyad 19 and 4

 coleman[1,4,19]
 coleman[1,19,4]

is symmetric, or reciprocal - we call this a mutual dyad.

In general we can distinguish between dyads that are Mutual, Assymetric, and Null (Holland and Leinhardt 1972):

par ( mfrow = c(1,3))
plot( as.network(coleman[1,c(4,19),c(4,19)] ),main='Mutual' ,label = c(4,19), usecurve = TRUE, edge.curv = 0.01, arrowhead.cex = 2.5, vertex.cex = degree(coleman[1,,])[c(4,19)] ) 
plot( as.network(coleman[1,c(1,14),c(1,14)] ),main='Assymetric' ,label = c(1,14), arrowhead.cex = 2.5, vertex.cex = degree(coleman[1,,])[c(1,14)] ) 
plot( as.network(coleman[1,c(1,2),c(1,2)] ),main='Null' ,label = c(1,2) , vertex.cex = degree(coleman[1,,])[c(1,2)] ) 

Fun Fact: Linear algebra

Recall that the inner product of a matrix \(X\) with itself \(X\) has entries \((XX)_{i,j}=\sum_{k}X_{i,k} X_{ k,j }\). Consequently, \(\mathrm{tr}(XX)\) gives you twice the number of reciprocated dyads. There is no standard command for trace in R, but you can take the sum() of the diagonal diag() of the matrix product, e.g. for adjacency matrix A, sum( diag( A %% A )). The standard product in R is the dot product \(X \odot X^{\top}\) where \((X \odot X^{\top})_{i,j}=X_{i,j} X_{ i,j }\). Consequently for adjacency matrix A, A * t(A) is the matrix of reciprochated ties.

Dyad census

The dyad census tabulates the number of dyads that are Mutual, Assymetric, and Null

 dyad.census(coleman[1,,])
##      Mut Asym Null
## [1,]  62  119 2447

Fun Fact: census is complete enumeration of dyads

Note that in the Coleman example, the total number of M, A, and N dyads sum up to

 62 + 119 +  2447

which is exactly equal to \(n(n-1)/2=73\times72/2=2628\) - this is the total number on (un-ordered) pairs, the number of cells in the upper diagonal of the adjacency matrix.

Triads

For tripplets in directed networks, there many different kinds. For open triads, we could for example consider the three different types you can achieve with 0 Mutual dyads, 2 Assymetric dyads, and 1 Null dyad.

par( mfrow = c(1,3))
plot( as.network( matrix(c(0,1,0,0,0,1,0,0,0),3,3 ,byrow=TRUE) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='021C')
plot( as.network( matrix(c(0,1,0,0,0,0,0,1,0),3,3 ) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='021D')
plot( as.network( matrix(c(0,0,0,1,0,1,0,0,0),3,3 ) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE,  main='021U')

The triples are labeled by the number of Mutual, Assymetric, and Null dyads - the so-called MAN labeling scheme. Here, the ‘C’, ‘D’, and ‘U’ distinguish between ‘Cyclic’, ‘Down’, and ‘Up’.

For closed triands consider the Transitive (‘T’), Cyclic (‘C’), and complete closed triads

par( mfrow = c(1,3))
plot( as.network( matrix(c(0,1,0,0,0,1,1,0,0),3,3 ,byrow=TRUE) ), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='030C')
plot( as.network( matrix(c(0,1,1,0,0,0,0,1,0),3,3 ) ,byrow=TRUE), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE, main='030T')
plot( as.network( matrix(c(0,1,1,1,0,1,1,1,0),3,3 )  ,byrow=TRUE), coord = matrix(c(0,0,5,10,10,0) ,3,2,byrow=TRUE) ,arrowhead.cex = 2.5, vertex.cex =3 , jitter = FALSE,  main='300', usecurve = TRUE, edge.curv = 0.01)

When considering the potential interpretation of these closed triads, imagine scenarios where ties reflect ‘liking’, ‘look up to’, ‘give orders’, ‘passing on information’, and how these structures would be reflected in status, chain of command, access to information, etc.

Triad census

In total there are 16 types of triangles, all of which are labelled using the MAN labeling scheme (Holland and Leinhardt 1972). For the Coleman data, we calulate the triad census as

 triad.census(coleman[1,,])
##        003  012  102 021D 021U 021C 111D 111U 030T 030C 201 120D 120U 120C 210
## [1,] 50171 7384 3957   64  121  128  139   70   23    1  20   43   10    9  34
##      300
## [1,]  22

Fun Fact: census is complete enumeration of triads

Note that in the Coleman example, the total number triads

sum(triad.census(coleman[1,,]))

which is exactly equal to \[\binom{n}{3}= \frac{n(n-1)(n-1)}{(3\times 2)}=73\times72\times71/6=62196\] - this is the total number on (un-ordered) tripplets, the number of ways in which you can chose 3 element subsets out of a 73 element set.

Plotting the network

Plotting the sociogram of the network does not differ from the undirected case, with the exception that the ties are now represented by arrows.

plot( as.network( coleman[1,,] , directed= TRUE), # the network object
      vertex.cex = degree(coleman[1,,], cmode = 'indegree')/5 + .2 )

The size of a node here is set proportional to the indegree centrality.

There are two large ‘clusters’ of nodes that are connected within but not between - these are two components

Non-parametric

Using the data we downloaded and formatted in the Data-Formatting.Rmd vignette, we will here investigate how much these (undirected) networks differ from random networks. Analogously, we might ask what features of the networks are not guided by chance.

Make sure that you have loaded the required libraries but also load ergm()

library("ergm")

Load formatted data

load(url("https://raw.githubusercontent.com/johankoskinen/CHDH-SNA/main/data/undirected.RData"))
ls()#list all the objects you downloaded
##  [1] "ADJ"                    "advice.adj"             "attributes"            
##  [4] "cc"                     "cd"                     "CoKaMe"                
##  [7] "CoKaMeAdvice"           "CoKaMeAttributes"       "CoKaMeDisc"            
## [10] "CoKaMeFriend"           "coleman"                "coord"                 
## [13] "deg.dist"               "degree.dist"            "DistanceHospital"      
## [16] "EIES1"                  "EIES1.net"              "EIES2"                 
## [19] "EIES2.net"              "EIESATT"                "g"                     
## [22] "gd"                     "greekBomb"              "greekBomb.net"         
## [25] "kapf"                   "KAPFTS1"                "KAPFTS1.net"           
## [28] "KAPFTS2"                "KAPFTS2.net"            "lgc"                   
## [31] "like3bin.net"           "Liking1"                "Liking2"               
## [34] "Liking3"                "Liking3Bin"             "max.deg"               
## [37] "n"                      "N"                      "net"                   
## [40] "net.ave.dist"           "net.centralisation"     "net.closed.triad"      
## [43] "net.components"         "net.dens"               "net.large.comp"        
## [46] "net.open.triad"         "net.profiles"           "net.size"              
## [49] "net.ties"               "netnames"               "Noordin"               
## [52] "Noordin.net"            "padgbus.net"            "padgetbusines"         
## [55] "padgettbus"             "padgettmar"             "padgmar.net"           
## [58] "RDCON"                  "RDCON.net"              "RDGAM"                 
## [61] "RDGAM.net"              "RDNEG"                  "RDNEG.net"             
## [64] "RDPOS"                  "RDPOS.net"              "row.index"             
## [67] "sage.net"               "SagemanAdjacencyMatrix" "SagemanAtt"            
## [70] "sampson"                "SizeHospital"           "temp"                  
## [73] "TransBin"               "TransBin.net"           "tribes.all"            
## [76] "tribesNeg"              "tribesNeg.net"          "tribesPos"             
## [79] "tribesPos.net"          "TypeHospital"           "use"                   
## [82] "valuedTransfer"         "vertex.sides"           "Webster"               
## [85] "Webster.net"            "WebsterAtt"             "wiring"                
## [88] "Wiring"                 "WTN"                    "WTN.net"               
## [91] "WTNmatrix"              "Zachary"                "ZacharyBinary"         
## [94] "ZacharyBinary.net"      "ZacharyValued"

The object net.profiles contains a number of formatted networks along with some network summary measures

names(net.profiles)
##  [1] "netnames"           "net.size"           "net.ties"          
##  [4] "net.dens"           "net.open.triad"     "net.closed.triad"  
##  [7] "net.centralisation" "net.components"     "net.large.comp"    
## [10] "net.ave.dist"

Some notation

We let the set of \(n\) nodes be \(V = \{1,\ldots,n \}\), and we define the symmetric adjacency matrix \(X=(X_{ij})_{ij \in V^{(2)}}\), where the non-redundant pairs are \(\mathcal{N} = { V \choose 2}\), with elements

\[ X_{ij}= \left\{ \begin{array}{lr} 1,&\text{if there is a tie from } i \text{ to } j, i,j \in V\\ 0,&\text{else} \end{array} \right. {\text{,}} \]

Completely random graph

Assume that we independently for each pair, \(\{i,j\}\), decided if the tie existed by tossing a fair coin. This means that, independently for all \(\{i,j\} \in \mathcal{N}\) \[ \Pr (X_{ij}=1) = 0.5 \] In sna, the function rgraph generates random graphs.

Padget’s business network

Generate a \(n=16\) completely random graph and compare it to the Padgett and Ansell (1993) Florentine families business network

Generate random network

We will generate a network that only has \(n\) in common with the Padgett data set.

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
n <- net.profiles$net.size[ row.index ]
g <- rgraph(n,#size of the network
            m=1,# number of networks to generate
            tprob=0.5,# tie probability
            mode="graph") # undirected (graph) or directed (digraph)
g.net <- as.network(g,directed=FALSE)# 'sna' give you a matrix
gden( padgbus.net )# observed density
gden(g.net)# density 

Now we can compare what a network would look like if ties were formed completely at random with our observed network.

Plot and compare

par(mfrow=c(1,2))
plot(padgbus.net)
plot(g.net)

Does it look more ‘random’ than the Padgett network?

This is only one random network, maybe it looks different by chance. Generate \(m = 100\) networks

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
n <- net.profiles$net.size[ row.index ]
g100 <- rgraph(n,#size of the network
            m=100,# number of networks to generate
            tprob=0.5,# tie probability
            mode="graph") # undirected (graph) or directed (digraph)

Calculate metrics

Any metric that we calculate for the observed network we can calculate for a random network

par( mfrow  = c(1,3) )
hist( gden( g100) , # the density for each of the 100 random networks
      xlim=range( gden( g100 ), 
                  gden( padgbus.net ) ) , main='density') 
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
degrees <-degree(g100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
                  t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates 
matplot(c(0:(max.deg)), t(deg.sist) ,
        type ='l',
        main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)

triads <- triad.census( g100 , mode='graph')[,4]# CHANGE
hist(triads ,
     xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
     main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference

Clearly, since the random networks are too dense, the degree distributions are also different from the observed network, with nodes on average having a higher degree. But

despite how dense the networks are, there are too many triangles but is that because we have so many more ties???

Fixed density: implications

Will a random network with density \(0.5\) resemble any networks?

Plot the densities for the observed networks as a function of size

plot( net.profiles$net.size[order(net.profiles$net.size)],
      net.profiles$net.dens[order(net.profiles$net.size)],
      type='b',
      xlab = 'network size',ylab = 'density',ylim=range(net.profiles$net.dens,.55))
abline(h=0.5, col='red')# reference line for the completely random graphs

fixing the density at 0.5 irrespective of network size give unrealisically dense networks

Density conditioned graph

If we do not get density right, we won’t get anything righ. We introduce a random graph that fixes the density.

Definition

Let \(L(x) = \sum_{i<j}x_{ij}\) be the number of edges in a graph. The number of graphs with exactly \(k\) edges \[ \mathcal{X}_k= \{ x \in \mathcal{X} : L(x)=k \} \] is given by \({ M \choose k}\), where \(M= { n \choose 2}\), and consequently the conditional distribution \[ \Pr(X=x \mid L(x)= k ) = \frac{1}{{ M \choose k}} \]

Note that are no longer deciding independently for each pair if there is a tie or not, rather we select \(k\) pairs at random from the total set of pairs and decide that they will have tie.

We refer to this model as \(X \thicksim U \mid L(X)= k\).

In sna, the function rgnm generates random graphs with exactly the same number of ties as the observed network (same density)

Florentine families

Simulate graphs

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
 
cg100 <- rgnm(n =100,# number of networks to generate
            nv=net.profiles$net.size[ row.index ],#size of the network
            m = net.profiles$net.ties[ row.index ],# number of edges
            mode="graph") # undirected (graph) or directed (digraph)

Plot random

Plot the observed network and one of the simulated networks

par(mfrow=c(1,2))
plot(padgbus.net)
plot(as.network(cg100[1,,],directed=FALSE ))

Does the random graph look more like the empirical graph?

Calculate metrics on random networks

Any metric that we calculate for the observed network we can calculate for a random network. Calculate density, degree distribution, and triangles for all 100 networks and plot and compare

par( mfrow  = c(1,3) )
hist( gden( cg100) , # the density for each of the 100 random networks
      xlim=range( gden( cg100 ), 
                  gden( padgbus.net ) ) , main='density') 
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
degrees <-degree(cg100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
                  t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates 
matplot(c(0:(max.deg)), t(deg.sist) ,
        type ='l',
        main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)

triads <- triad.census( cg100 , mode='graph' )[,4]
hist(triads ,
     xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
     main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference

The density is constant but

The degree distributions have completely different shape to the observed (grey) and few of the simulated networks have many triangles

Other example

netnames
##  [1] "PadgetB"   "PadgetM"   "TribesPos" "TribesNeg" "WireGam"   "WireCon"  
##  [7] "WirePos"   "WireNeg"   "Karate"    "Sageman"   "Greek"     "Webster"  
## [13] "Noordin"   "KapfS1"    "KapfS2"    "EIES1"     "EIES2"
# Pick one
net <- tribesPos.net# to simply rename temporarily
row.index <- which(net.profiles$netnames=='TribesPos')# the row in the matrix containing Padgett summaries

Simulate graphs

cg100 <- rgnm(n =100,# number of networks to generate
            nv=net.profiles$net.size[ row.index ],#size of the network
            m = net.profiles$net.ties[ row.index ],# number of edges
            mode="graph") # undirected (graph) or directed (digraph)

Plot random

Plot the observed network and one of the simulated networks

par(mfrow=c(1,2))
plot(net)
plot(as.network(cg100[1,,],directed=FALSE ))

Calculate metrics

Calculate density, degree distribution, and triangles for all 100 networks and plot and compare

par( mfrow  = c(1,3) )
hist( gden( cg100) , # the density for each of the 100 random networks
      xlim=range( gden( cg100 ), 
                  gden( net ) ) , main='density') 
abline( v=gden( net ),col='red' )# the observed density for reference

degrees <-degree(cg100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
                  t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates 
matplot(c(0:(max.deg)), t(deg.sist) ,
        type ='l',
        main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)

triads <- triad.census( cg100 , mode='graph' )[,4]
hist(triads ,
     xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
     main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference

For all graphs

The conditional uniform \(X \thicksim U \mid L(X)= k\) distribution does not seem to do a good job of clustering. Let us check for all of our networks prepared in Data-Formatting.Rmd:

par(mfrow=c(1,3))
BigT <- apply(net.profiles[,c(2,3)],1,function(x) {
  triad.census( rgnm(n =30,# number of networks to generate
            nv=x[1],#size of the network
            m = x[2],# number of edges
            mode="graph") # undirected (graph) or directed (digraph)
  , mode="graph")[,4]# important 'graph' giving the 4 triads
  })



boxplot(BigT[,order(net.profiles$net.size)],#the raw triad counts ordered by network size
        ylim=range(BigT,net.profiles$net.closed.triad),
        main='triangles',clab='network',ylab='raw counts')
lines(net.profiles$net.closed.triad[ order(net.profiles$net.size) ],type='b',col='red')

ave.T <- colMeans(BigT)
BigT.norm <- apply(BigT,2, function(x) x/mean(x))# normalize

boxplot(BigT.norm[,order(net.profiles$net.size)],# the normalised triad counts ordered by network size
        ylim=range(BigT.norm,net.profiles$net.closed.triad/ave.T ),
        main='triangles',clab='network',ylab='centered counts')
lines(net.profiles$net.closed.triad[order(net.profiles$net.size)]/ave.T[order(net.profiles$net.size)] ,type='b',col='red')




BigC <- apply( net.profiles[,c(2,3)],1,function(x) {
  centralization( rgnm(n =30,# number of networks to generate
            nv=x[1],#size of the network
            m = x[2],# number of edges
            mode="graph") # undirected (graph) or directed (digraph)
  , degree, mode="graph")
  })
boxplot(BigC[,order(net.profiles$net.size)],# the centrality scores ordered by network size
        ylim=range(BigC,net.profiles$net.centralisation),ylab='frequency',xlab='network',main='Centralization')
lines(net.profiles$net.centralisation[order(net.profiles$net.size)],# add observed value for reference
      type='b',col='red')

Clearly, density is not the only thing that matters in tie-formation because no simulated networks produce enough triangles

Erdos-Reyini

Definition

The Bernoulli graph assumes that independently for each dyad \(\{i,j\} \in \mathcal{N}\), a tie is formed with probability \(p_{ij} \in (0,1)\) \[ \Pr(X_{ij}=1 ) = p_{ij} \] It is a homogenous Bernoulli graph, or Erdos-Reyni graph, if \(p_{ij}=p\) for all \(\{i,j\} \in \mathcal{N}\). We say that \[ X \thicksim Bern(p) \] for a homogenous Bernoulli model with tie-probability \(p\).

A particular form of Bernoulli graphs are stochastic blockmodels, where the tie-probability for \(\[ i,j \}\) depends on class memberships of the nodes \(i\) and \(j\).

Padget’s business network

Assuming that \(X \thicksim U \mid L(X)= k\) did produce enough triangles and gave the wrong shape for the degree distribution. Assuming \(X \thicksim Bern(p)\), the average density of graphs will be equal to that of the observed network \(x_{\mathrm{obs}}\) if \[ p= \frac{L(x_{\mathrm{obs}})}{M} \] but the variance in \(L(X)\) will not be zero as in \(X \thicksim U \mid L(X)= k\).

Will introducing more variability produce more triangles and more skewed degree distriubions?

Simulate graphs

row.index <- which(net.profiles$netnames=='PadgetB')# the row in the matrix containing Padgett summaries
ber100 <- rgraph(net.profiles$net.size[ row.index ],#size of the network
            m=100,# number of networks to generate
            tprob=net.profiles$net.dens[ row.index ],# tie probability equal to observed density
            mode="graph") # undirected (graph) or directed (digraph)

Plot random

Plot the observed network and one of the simulated networks

net <- padgbus.net# to simply rename temporarily
par(mfrow=c(1,2))
plot(net)
plot(as.network(ber100[1,,],directed=FALSE ))

gden(net)# observed density
## [1] 0.125
gden(ber100[1,,])# density of a simulated network
## [1] 0.075

Does the random graph look more like the empirical graph?

Calculate metrics on random networks

Any metric that we calculate for the observed network we can calculate for a random network. Calculate density, degree distribution, and triangles for all 100 networks and plot and compare

par( mfrow  = c(1,3) )
hist( gden( ber100 ) , # the density for each of the 100 random networks
      xlim=range( gden( ber100 ), 
                  gden( net ) ) , main='density') 
abline( v=gden( net ),col='red' )# the observed density for reference


degrees <-degree(ber100,g=c(1:100), cmode='indegree')# you can calculate the degree distributions for all graphs
deg.sist <- cbind( matrix( colSums(degrees==0),100,1),
                  t(apply(degrees,2,function(x) tabulate(x, nbins=max.deg) ) ) )# when tabulating we need to add the number of isolates 
matplot(c(0:(max.deg)), t(deg.sist) ,
        type ='l',
        main='degree distribution' )
lines(c(0:(max.deg)),degree.dist[row.index,c(2:(max.deg+2))],col='grey',lwd=3)

triads <- triad.census( ber100 , mode='graph' )[,4]
hist(triads ,
     xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
     main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference

The degree distributions are more forgiving (but all have wrong shape) but still too few triangles

To get an idea just of how unlikely it is that a Bernoulli graph could have produced \(5\) triangles (as in Padgett’s data) or more is

mean( triads >= 5 )
## [1] 0.02
distances <- apply(ber100,
                          1, 
                          function(x) tabulate( geodist( x )$gdist[upper.tri(geodist( x )$gdist)] , nbins=10 ) )  
                    
 
matplot(c(1:(10)), distances ,
        type ='l',
        main='geodesic distribution' ,
        xlab='distance',
        ylab='frequency')
lines(c(1:(10)),
      tabulate( geodist( net )$gdist[upper.tri(geodist( net )$gdist)] , nbins=10 ),
      pch=24,bg="blue",lty=1,type='b',
      col='red',lwd=3 )

The model \(X \thicksim Bern(p)\) does great in capturing the distances - Bernoulli graphs have short pathlengths. It achieves this by sacrifing clustering though. If there is a choice between ‘linking to someone new’ or ‘linking back to someone you indirectly know’, it choses the new one - because ties form idependenlty, the Bernoulli tie-probability does not know whom a node already has ties to.

Bernoulli graphs do not waste ties on clustering - this creates reach

Interpreting this in terms of tie-formation processes:

Independence means that a tie is more likely to a ‘new’ person than to ‘a friend of a friend’ because Bernoulli does not know (independence) whom your friends already are

Clearly, we cannot assume that people form ties independently of each other!

For all graphs

Let us see how \(X \thicksim Bern(p)\) does for all networks.

par(mfrow=c(1,3))
BigT <- apply(net.profiles[,c(2,4)],1,function(x) {
  triad.census( rgraph(n = x[1],
                       m =30,# number of networks to generate
                       tprob= x[2], # density of network
                       mode="graph") # undirected (graph) or directed (digraph)
  , mode="graph")[,4]# important 'graph' giving the 4 triads
  })



boxplot(BigT[,order(net.profiles$net.size)],#the raw triad counts ordered by network size
        ylim=range(BigT,net.profiles$net.closed.triad),
        main='triangles',clab='network',ylab='raw counts')
lines(net.profiles$net.closed.triad[ order(net.profiles$net.size) ],type='b',col='red')

ave.T <- colMeans(BigT)
BigT.norm <- apply(BigT,2, function(x) x/mean(x))# normalize

boxplot(BigT.norm[,order(net.profiles$net.size)],# the normalised triad counts ordered by network size
        ylim=range(BigT.norm,net.profiles$net.closed.triad/ave.T ),
        main='triangles',clab='network',ylab='centered counts')
lines(net.profiles$net.closed.triad[order(net.profiles$net.size)]/ave.T[order(net.profiles$net.size)] ,type='b',col='red')




BigC <- apply(net.profiles[,c(2,4)],1,function(x) {
  centralization( rgraph(n = x[1],
                       m =30,# number of networks to generate
                       tprob= x[2], # density of network
                       mode="graph")  # undirected (graph) or directed (digraph)
  , degree, mode="graph")
  })
boxplot(BigC[,order(net.profiles$net.size)],# the centrality scores ordered by network size
        ylim=range(BigC,net.profiles$net.centralisation),ylab='frequency',xlab='network',main='Centralization')
lines(net.profiles$net.centralisation[order(net.profiles$net.size)],# add observed value for reference
      type='b',col='red')

The extra variability offered by \(Bern(p)\) does a little to improve the features of the simulated networks but much of the clustering and centralisation remains unexplained.

Conditional degree distribution

Since it is clear that \(Bern(p)\) does not capture the degree distribution of the typical observed network. This is not surprising given that the tie-probabilities for all dyads is the same and thereby for all nodes - all nodes should have on average the same degree.

To see how much of the structure of a network we can explain through the degree distribution we introduce yet another model.

Definition

Like the \(U \mid L(X)=k\) distribution, we define a distribution for graphs (not for tie-variables). We let \(X \thicksim U \mid X_{\cdot+}=d\) mean that the distribution is uniform on all the graphs that have the exact same degree distribution. Thus \[ \Pr( X = x) = \left\{ \begin{array}{lr} c^{-1},&\text{if } x_{\cdot+}=d\\ 0,&\text{else} \end{array} \right. {\text{,}} \] where \(x_{\cdot+}=(\sum_j x_{1j},\ldots,x_{nj})^{\top}\) is the vector of degrees, and \(c = \sharp \{ x \in \mathcal{X}: x_{\cdot+}=d \}\).

There are several algorithms for simulating networks with fixed degree distribution and all of (the reliable ones) rely on Markov chain Monte Carlo (MCMC).

Note on MCMC

The premise of MCMC is the you want to draw one variable \(x\) from a distribution \(p(x)\) but you cannot do that directly. What you can do though is to say how much more likely one realisation of a variable is relative to another. For example, let \(x\) and \(y\) be two networks and that the target distribution is \(p(x)\). What we need to know in order to use MCMC is the ratio \(p(x)/p(y)\) for all \(x,y \in \mathcal{X}\). This allows us to generate a sequence of realisations \[ x_0,x_1,x_2,\ldots,x_K \] of outcomes. This is an iterative scheme where you have an updating rule that takes an outcome \(x_t\) and updates it to \(x_{t+1}\), and this update relies on our ability to calculate \(p(x)/p(y)\) for all \(x,y \in \mathcal{X}\). In particular, if we are in state \(x_t\), and we consider moving to state \(x^{\ast}\), we know how much more or less likely the new state is relative to the old state \(p(x^{*})/p(x_t)\). For networks, these updates are often incremental and you either set \(x_{t+1}:=x^{\ast}\) or \(x_{t+1}:=x_{t}\), depending on \(p(x^{*})/p(x_t)\). Consequently, sometimes we stay in the same state, in the same network, for many iterations.

Burnin: to make sure that \(x_t\) has ‘forgotten’ the initial state \(x_0\), you usuall allow for a number of interations to pass - this is the burnin period.

Thinning: If you have an adequate burning and want to draw many outcomes, you may not have to do the same burnin again, instead you allow for a certain number of iterations bewtween sucessive sample points - the number of iterations you wait is called thinning.

Padget’s business network

Simulate graphs

g.udegs <- simulate(padgbus.net~edges,
                    coef=c(0),
                    constraints=~degreedist,# this is what does it degreedist is based on padgbus.net
                    nsim=100,
                    control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network

To confirm that this network actually has the same degree distribution:

par(mfrow=c(1,3))
deg1 <- degree(padgbus.net, cmode='indegree')
deg2 <- degree(g.udegs[[5]], cmode='indegree')# this is a list of networks, not an array, hence [[k]]
plot( deg1,deg2 ,xlab='Padgett', ylab ='random network',main='degrees')#
plot( deg1[order(deg1)],deg2[order(deg2)] ,ylab ='random network',main='degrees')# the degree distribution is the same but not the same nodes have the same degrees
plot( table(deg1), type='l',col='grey',lwd=4)
lines( table(deg2) ,col='red' )

From the first panel we see that not every node has the same degree in the random network as in the observed network. However, the degree distribution is the same, as seen in thee right-hand panel. The middle panel shows that there is a mapping from the nodes in \(x_{\mathbf{obs}}\) to the nodes in \(x\), so that if for every node with degree \(k\) in the former, there is exactly one node in the latter that has degree \(k\).

Plot random

Plot the observed network and then three of the random networks

par( mfrow =c(1,4) )
plot( padgbus.net )
plot( g.udegs[[5]] )
plot( g.udegs[[50]] )
plot( g.udegs[[100]] )

The simulated networks look like pretty faithful represeantions of the observed network

Since the degree distribution is identical to the observed network, the density will also be preserved for all simulated networks. Let us look at triangles and geodesic distances.

par( mfrow  = c(1,3) )
row.index <- which(net.profiles$netnames=='PadgetB')# t

triads <- triad.census( g.udegs , mode="graph")[,4]
hist(triads ,
     xlim= range( triads , net.profiles$net.closed.triad[ row.index ] ),
     main = 'closed triads')
abline( v=net.profiles$net.closed.triad[ row.index ],col='red' )# the observed density for reference

triads <- triad.census( g.udegs ,mode="graph" )[,3]
hist(triads ,
     xlim= range( triads , net.profiles$net.open.triad[ row.index ] ),
     main = 'open triads')
abline( v=net.profiles$net.open.triad[ row.index ],col='red' )# the observed density for reference

# because g.udegs is a list it is a little more complicated to use built in sna functions
distances <- matrix(0,10,100)
for (k in c(1:100))
{
  distances[,k] <- tabulate( geodist( g.udegs[[k]] )$gdist[upper.tri(geodist( g.udegs[[k]] )$gdist)], 
                             nbins=10 )  
}

matplot(c(1:(10)), distances ,
        type ='l',
        main='geodesic distribution' ,
        xlab='distance',
        ylab='frequency')
lines(c(1:(10)),
      tabulate( geodist( padgbus.net )$gdist[upper.tri(geodist( padgbus.net )$gdist)] , nbins=10 ),
      pch=24,bg="blue",lty=1,type='b',
      col='red',lwd=3 )

The \(X \thicksim U \mid X_{\cdot+}=d\) model does not fully capture clustering and it over-produces open triangles.

Even if we could model the degree of every node exactly, we cannot produce the same clustering that we see for real networks

The degree distribution alone is not sufficient for explaining clustering and reach - people do not only care about how many friends they have

ERGM

Definition

An introduction to ERGMs is given in the slides. Formally, an ERGM is a model \[ \Pr( X = x \mid \theta) = \frac{ \exp \{ \theta^{\top}z(x) \} }{ \sum_{x \in \mathcal{X} } \exp \{ \theta^{\top}z(x) \} } \]

where * \(\theta\) is a vector of statitical parameters * \(z(x)\) is a vector of statistics calculated on \(x\) * the denominator is a sum over all graphs ensuring that the probaility sums to one

Example

If \(z(x)=L(x)\), the only model statistic is the number of edges in the graph. This model is equivalent to \(Bern(p)\), where \[ \theta = - \log\left( \frac{1}{p-1}\right) \] or, expressed in terms of \(p\) \[ p= \frac{e^{\theta L(x)}}{1+e^{\theta L(x)}} \]

Properties

A property of ERGM is that you can set the values of the expected statistics. The expected statistics \[ E_{\theta} \{ z(X) \} = \sum_{x \in \mathcal{x}} z(x)\Pr( X = x \mid \theta) \] are completely determined by the parameters.

Padget’s business network

Bernoulli model

Simulate networks from an ERGM with only the statistic \(z(x)=L(x)\) and paramter \(\theta = −1.945\)

g.ergm.bern <- simulate(padgbus.net~edges,# one statistic
                    coef= -1.945,# one parameter
                    nsim=100,
                    control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network

Inspect networks

par( mfrow  = c(1,2) )
hist( gden( g.ergm.bern ) , # the density for each of the 100 random networks
      xlim=range( gden( g.ergm.bern ), 
                  gden( padgbus.net ) ) , main='density') 
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference


triads <- triad.census( g.ergm.bern , mode='graph' )[,4]
hist(triads ,
     xlim= range( triads , triad.census( padgbus.net , mode='graph' )[,4] ),
     main = 'closed triads')
abline( v= triad.census( padgbus.net , mode='graph' )[,4],col='red' )# the observed density for reference

The denisty of graphs is centred right over the observed value! Because this model is equivalent to \(Bern(p)\), the missfit of closed triads is the same as before.

The Bernoulli ERGM assumes that ties form independently - how can we relax that?

Markov model

Since the Bernoulli ERGM does not reproduce the number of triangles, how do can we improve on that?

If we add statistics for the number of triangles we can increase the number of triangles that the model produces. Let us add statistics:

  • edges: \(\sum_{ i <j }x_{ij}\)
  • two-stars: \(\sum_{ i ,j,k }x_{ij}x_{ik}\)
  • three-stars: \(\sum_{ i ,j,k,h }x_{ij}x_{ik}x_{ih}\)
  • triangles: \(\sum_{ i ,j,k}x_{ij}x_{ik}x_{jk}\)

We will use the parameters \(\theta = \left( -2.1113 ,1.0465 ,-0.6318, 1.3064 \right)^{\top}\)

g.ergm.markov <- simulate(padgbus.net~edges+kstar(2) +kstar(3) + triangles,# one statistic
                    coef= c(-4.4878  ,  1.2556 ,  -0.7059  ,  1.0266),# one parameter
                    nsim=100,
                    control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network

Inspect networks

par( mfrow  = c(2,2) )
hist( gden( g.ergm.markov ) , # the density for each of the 100 random networks
      xlim=range( gden( g.ergm.markov ), 
                  gden( padgbus.net ) ) , main='density') 
abline( v=gden( padgbus.net ),col='red' )# the observed density for reference


triads <- triad.census( g.ergm.markov , mode='graph' )
obsetriad <- triad.census(padgbus.net, mode='graph' )

hist(triads[,2] ,
     xlim= range( triads[,2] , obsetriad[2] ),
     main = 'single edge triads')
abline( v= obsetriad[2],col='red' )# the observed density for reference

hist(triads[,3] ,
     xlim= range( triads[,3] , obsetriad[3] ),
     main = 'open triads')
abline( v= obsetriad[3],col='red' )# the observed density for reference

hist(triads[,4] ,
     xlim= range( triads[,4] , obsetriad[4] ),
     main = 'closed triads')
abline( v= obsetriad[4],col='red' )# the observed density for reference

The Markov model manages to reproduce all of these local, structural statistics

Let us look at a global property, like geodesic distances

row.index <- which(net.profiles$netnames=='PadgetB')# t

# because g.udegs is a list it is a little more complicated to use built in sna functions
distances <- matrix(0,10,100)
for (k in c(1:100))
{
  distances[,k] <- tabulate( geodist( g.ergm.markov[[k]] )$gdist[upper.tri(geodist( g.ergm.markov[[k]] )$gdist)], 
                             nbins=10 )  
}

matplot(c(1:(10)), distances ,
        type ='l',
        main='geodesic distribution' ,
        xlab='distance',
        ylab='frequency')
lines(c(1:(10)),
      tabulate( geodist( padgbus.net )$gdist[upper.tri(geodist( padgbus.net )$gdist)] , nbins=10 ),
      pch=24,bg="blue",lty=1,type='b',
      col='red',lwd=3 )

The Markov model does a pretty spectacular work of also capturing the reach in the network

The Markov model with 4 paramters is sufficient to explain density, local clustering, as well as connectivity

We can interpret this as the degree-based effects - the edges and stars - cature mechanisms to do with popularity and activity; and, the triangle parameter captures triadic closure - friends of my friends are also my friends. There are many behvioural interpretations of these mechanisms but at the end of the day they are sufficient for explaining a lot of the network structure.

Kapferer’s tailors

Let us try an ERGM for Kapferer’s tailors. This time, we use the statistics edges and so-called alternating triangles. Alternating triangles, or GWESP, counts the number of shared partners that two directly tied individuals have. Each additional shared partner adds to the statistic but with a geometrically decreasing weight.

g.ergm.gwesp <- simulate(KAPFTS1.net~edges+gwesp(decay = log(2) , fixed= TRUE),# one statistic
                    coef= c(-3.795, 1.075),# the two parameters
                    nsim=100,
                    control=control.simulate(MCMC.burnin=100000))# you need to bump up the default burnin, otherwise the networks are too similar to the starting point, the observed network

Inspect networks

par( mfrow  = c(2,2) )
hist( gden( g.ergm.gwesp ) , # the density for each of the 100 random networks
      xlim=range( gden( g.ergm.gwesp ), 
                  gden( KAPFTS1.net ) ) , main='density') 
abline( v=gden( KAPFTS1.net ),col='red' )# the observed density for reference


triads <- triad.census( g.ergm.gwesp , mode='graph' )
obsetriad <- triad.census(KAPFTS1.net, mode='graph' )

hist(triads[,2] ,
     xlim= range( triads[,2] , obsetriad[2] ),
     main = 'single edge triads')
abline( v= obsetriad[2],col='red' )# the observed density for reference

hist(triads[,3] ,
     xlim= range( triads[,3] , obsetriad[3] ),
     main = 'open triads')
abline( v= obsetriad[3],col='red' )# the observed density for reference

hist(triads[,4] ,
     xlim= range( triads[,4] , obsetriad[4] ),
     main = 'closed triads')
abline( v= obsetriad[4],col='red' )# the observed density for reference

This two-paramter model seem to capture the local clustering well.

An overall cost for ties (edges) and a tendency for people that have shared friends to also be friends explains most of the network strucure.

Conclusions

Based on these experiments:

  • Assuming that ties form independely is unrealistic
  • Independent tie-formation may create short pathways
  • Independence means that ** The rich do not get richer (nodes do not know how many ties they already have) ** Closure does not happen (ties have ‘no memory’)
  • The degree distribution only explain some network features ** Knowing what nodes are popular do not explain whom they form ties to

Homework

Explore what other features the random graphs may or may not model

Pick a network, pick some ERGM terms by reference to ergm.terms (use the help function ?ergm.terms).

Try to estimate a model and then simulate from it. How did I get my parameter values? For Padgett is got the coefficients:

# don't run
ans <- ergm(padgbus.net~edges+kstar(2) +kstar(3) + triangles)#
summary(ans)# this produces an ANOVA table with coefficients

And for Kapferer:

# don't run
ans <- ergm(KAPFTS1.net~edges+gwesp(decay = log(2) , fixed= TRUE))#
summary(ans)# this produces an ANOVA table with coefficients

Can you pick any statistic arbitrarily to include in the model?

There are at least three reasons the answer to this is no:

  • Higher order effects, like triangles, may be explained by the aggregation of lower-order statistics (such as two-stars)
  • Even if you want to model one specific statistic, it is not certain that you will be able to (see homework)
  • The Markov model and a model with GWESP can be represented in terms of principled dependence assumptions and other statistics do not have that interpretation

Ego-nets

For standard surveys, respondents are sampled independently of each other and we do not get any network data. The one piece of network information we can get from a surevey is the number of contacts a respondent has - this is the degree of the respondent.

How much can we say about a network and a persons network position based only on independent reports?

Survey data may provide egocentric, as opposed to socio-centric, network data. We call this egonetworks (Crossley et al. 2015).

In the US General Social Survey, respondents have been asked to nominate a range of different contacts. Here, we will

  • investigate the degree distributions of different relations,
  • examine to what extent we can explain the degree centrality using independent variables, and
  • examine the extent to which the degree centrality can explain other outcomes

The last sections (Fitting a distribution and Measurement error) provide brief introductions to more advanced topics.

Loading data

We are going to read in a stata file and for that we need the package

library('foreign')
library('haven')# this package is needed to read Stata version 5-12 .dta file 
library('labelled')# this is needed to undo all the annoying things haven does

Download the General Social Survey data from the web first time around. Going forward you may want to save it locally. You can download data from https://gssdataexplorer.norc.org/ and read in the downloaded dataset (also the GSS for 1985) but here we will read the data straight from the url:

#GSSdata <- read.dta( 'https://osf.io/hn5b9/download' , convert.factors=FALSE )
GSSdata <- as.data.frame( read_dta( 'https://osf.io/npjv3/download' ), stringsAsFactors=FALSE )
val_labels(GSSdata) <- NULL

For r'read.dta', by setting convert.factors=FALSE we are using numerical codes for categorical variables

Data matrix

The dataset is a

class(GSSdata)
## [1] "data.frame"

with dimensions

dim(GSSdata)
## [1] 2812 1180

constituting 2812 cases and 1181 variables. All variables are listed on http://www.thearda.com/Archive/Files/Codebooks/GSS2004_CB.asp

Basic demographic variables

Basic variables include

Number in list variable name name in GSSdata description
(3) WRKSTAT wrkstat employment status
(12) MARITAL marital marital status
(24) PAOCC80 PAOCC80 father’s and mothers occupation
(27) MAOCC80 MAOCC80 mothers occupation
(39) DEGREE degree education
(43) SEX sex sex of respondent
(33) AGE age age of respondent

There are many measures of attritudes, for example political views (73 POLVIEW), towards government spending (74-101), race, racism, and ethnicity, religion etc. In general there are also a lot of questions about third parties, such as the household and household members. Many questions consists of a long battery of targeted attitudes, such as ‘confidence’ in questions (163-176).

Psychological wellbeing

There are a variety of Psychological wellbeing measures

Number in list variable name name in GSSdata description
(157) HAPPY happy A happiness measure
(160) LIFE life Excitement with Life
(459) MNTLHLTH mntlhlth self-assesed (negative) mental health
(960) HLTH2 HLTH2 If underwent counseling for mental or emotional problems

Scales, subscales, and single items

While there is no specific documentation or listing of scales used in the survey, we recognise a number of questions as items from other scales.

Number in list Scale no. items reference
(161 - 162) Trust in People Scale 3 Yamagishi (1986)
(462-468) Empathic Concern subscale of interpersonal reactivity index 7 M. H. Davis et al. (1980)
(515-519) The Rosenberg Self-Esteem Scale 5 Rosenberg (2015)
(520-525) Subscale of Life orientation scale 6 Scheier, Carver, and Bridges (1994)
(524) OWNDOING, locus of control 1 Rotter (1966)
(804-805) Trust in People Scale 2 see e.g. Yamagishi (1986)

It is worth looking through http://www.thearda.com/Archive/Files/Codebooks/GSS2004_CB.asp searching for terms of interest.

Network summaries

The 2004 (and some earlier surveys) contain a number of network and network-related variables.

Variable name Description Note
partners no. sex partners past year listed as PARTNERS in help file. Vars 931-938 further breakdowns
numwomen no. female sex partners listed as NUWOMEN in help file
nummen no. male sex partners listed as NUMMEN in help file
PARTNRS5 no. sex partners in past 5 years binned
numgiven no. close contacts truncated at 6+.
numcnt no. people keep in contact with this is other than family and work
434-437 mode of contact keeping in contact with f-t-f, letters, email, etc

These are summaries of a (population) sexual network and a (population) close contact network (a number of related network questions address different aspect of social interaction).

Note: The variable HAPPY, the Trust (161 - 162) and EC (462-468) scales, are not asked of people that provided ‘numgiven’. Therefore it is important to check cross tabulations before doing an analysis, for example, tabulating happy against numgiven demonstrates that there are no respondents with values on both.

table(GSSdata$happy,GSSdata$numgiven,useNA='always')

There were 6 versions of the survey administered and not all of them included all questions, for example

table(GSSdata$version,GSSdata$happy, useNA='always')
table(GSSdata$version,GSSdata$numgiven, useNA='always')

Number of sex partners

The networks of sexual contacts has received a lot of attention in public health, see for example Morris (2004). The number of sexual contacts also serves as a good illustration for modelling the degree distribution of a network from survey data.

The number of sex partners is partially reported as binned and the missing data code is 99

sexpart <- GSSdata$partners
sexpart[sexpart %in% c(9,989,99)] <- NA # missing data codes
sexpart[sexpart == 5 ] <- 7.5 # middle of range (4,10]
sexpart[sexpart == 6 ] <- 15 # middle of range (10,20]
sexpart[sexpart == 7 ] <- 60.5  # middle of range (20,100]
sexpart[sexpart == 8 ] <- 100 # 100+

The distribution is now

hist(sexpart)

And we can look at difference between male and female

boxplot(sexpart~ GSSdata$sex)

However, testing the difference in number of sex partners between men an women, there is a clear difference

sexreg <- lm(sexpart~ GSSdata$sex==1)
summary(sexreg)
## 
## Call:
## lm(formula = sexpart ~ GSSdata$sex == 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.989 -0.989 -0.083 -0.083 98.011 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.0831     0.1589   6.814 1.22e-11 ***
## GSSdata$sex == 1TRUE   0.9055     0.2344   3.863 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.472 on 2192 degrees of freedom
##   (618 observations deleted due to missingness)
## Multiple R-squared:  0.006763,   Adjusted R-squared:  0.00631 
## F-statistic: 14.92 on 1 and 2192 DF,  p-value: 0.0001151

An equivalent way of testing this difference is by doing a t-test

t.test( sexpart~ GSSdata$sex==1, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  sexpart by GSSdata$sex == 1
## t = -3.8633, df = 2192, p-value = 0.0001151
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -1.3651153 -0.4458451
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            1.083122            1.988603

Note that the t-statistics are identical for the regression coefficient and in the t-test.

If these are the degree distributions in an undirected network, how come that there are such differences? (see also Spiegelhalter 2015).

Because of this discrepancy and the binning we will focus solely on the number of sexual partners of men in the sequel.

Modelling partners

We are going to define \(Y_i\) as the number of sex partners of a male \(i\).

The variables \(Y_i\) are the out-degrees of men in the network of sexual partners

We will model \(Y_i\) using

  • Linear regression
  • Log-linear regression
  • Poisson regression
  • A power-law

Recode missing values for number of female sex partners and for the variable pre-marital sex

numwom <- GSSdata$nuwomen
numwom[numwom  > 993 ] <- NA # missing data codes
is.man <- GSSdata$sex==1
premarsx <- GSSdata$premarsx
premarsx[premarsx > 4] <- NA

OLS

We assume \[ Y_i = \beta_0 + \beta_1 x_{i} +\epsilon_i \] for \(x_{i}\) being attitude to pre-marital sex and \(\epsilon_i \overset{i.i.d}{\thicksim}N(0,\sigma^2)\).

The OLS estimates are given by

y <- numwom[ is.man ]# this for plotting reasons below
x <- premarsx[is.man ]# ditto
not.use <- is.na(y) | is.na(x)
y <- y[not.use==FALSE]
x <- x[not.use==FALSE]
nuwom.reg <- lm( y ~ x)
summary( nuwom.reg )
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47.21 -37.21 -22.21  -5.63 956.31 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   -2.901     18.242  -0.159   0.8737  
## x             12.529      5.548   2.258   0.0246 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 121.6 on 337 degrees of freedom
## Multiple R-squared:  0.01491,    Adjusted R-squared:  0.01198 
## F-statistic:   5.1 on 1 and 337 DF,  p-value: 0.02457

Check the residuals to see if the normality assumption appears satisfied

par(mfrow=c(1,2))
hist(nuwom.reg$residuals)
plot(x=predict(nuwom.reg), y= y,
     xlab='Predicted Values',
     ylab='Actual Values',
     main='Predicted vs. Actual Values')
abline(a=0, b=1)

Residuals are clearly highly skewed and

Log-normal

While the residuals are very skewed (and not normal), the logarithm of the residuals might look different (we would have to do some transformation as we cannot take the log of non-positive values).

Outcome variable needs to be strictly positive

is.pos <- y>0

Letting \(Z_i=\log(Y_i)\), we estimate \[ Z_i = \alpha +\gamma X_{i} +\xi_i \]

log.nuwom.reg <- lm( log(y[is.pos]) ~ x[is.pos])
summary(log.nuwom.reg)
## 
## Call:
## lm(formula = log(y[is.pos]) ~ x[is.pos])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4972 -1.3986 -0.1946  1.0435  4.7534 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08942    0.24140   4.513 9.04e-06 ***
## x[is.pos]    0.35195    0.07288   4.829 2.14e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.522 on 315 degrees of freedom
## Multiple R-squared:  0.06894,    Adjusted R-squared:  0.06598 
## F-statistic: 23.32 on 1 and 315 DF,  p-value: 2.141e-06
par(mfrow=c(1,2))
hist(log.nuwom.reg$residuals)
plot(x=predict(log.nuwom.reg), y= log(y[is.pos]),
     xlab='Predicted Values',
     ylab='Actual Values',
     main='Predicted vs. Actual Values')
abline(a=0, b=1)

Poisson regression

The orginal degrees \(Y_i\) are clearly counts and the Poisson distribution may be a better model for counts \[ E(Y_i \mid x_i ) = \lambda_i,\;\Pr(Y_i=h\mid x_i)=e^{-\lambda_i}\frac{\lambda_i^h}{h!},{\text{ and }}\log(\lambda_i)=\eta_i=\beta_0+\beta_1x_i \]

We estimate \(\beta_0\) and \(\beta_1\) using r'glm'

reg.po <- glm( y ~ x, family = poisson(link='log') )
summary( reg.po )
## 
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -9.830  -7.187  -4.620  -1.379  70.675  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.00319    0.03905    51.3   <2e-16 ***
## x            0.46863    0.01065    44.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 38576  on 338  degrees of freedom
## Residual deviance: 36062  on 337  degrees of freedom
## AIC: 37354
## 
## Number of Fisher Scoring iterations: 7

Power-law

In addition to standard statistical distributions, a number of models for networks have been proposed that give rise to distributions for the degree distribution. One such model, is the mathematical mechanism preferential attachment which is claimed to give rise to a degree distribution that is distributed as a power-law (Barabási and Albert 1999).

Liljeros et al. (2001) make very strong claims about the distribution of sex partners - take-home point is ‘is degree the only determinant of connectivity and spread?’

\[ \Pr(Y=k) \propto k^{-\alpha} {\text{ , for }} k > k_c \]

Drawing on epidemiology that suggests that any disease spread on the network becomes epidemic if \(\alpha \leq 3\).

In their paper, Barabási and Albert (1999), they calim extensively that that the power-law form proves that the only mechanims that is at play in networks is preferential attachement (wrong because it isn’t and there is not a one-to-one between the power-law and preferential attachement)(Robins G. L., Pattison, and Koskinen 2008)

Modelling the degree distribution

The package r`'poweRlaw' can be used to fit different functional forms to the degree distribution. To use it first time, install it.

install.packages('poweRlaw')

Load the package

library('poweRlaw')

The power-law does not straightforwardly allow explantory variables so let us simply re-estimate the log-normal and Poissson without predictors for comparison. Note, the power-law cannot handle zeros, so now we estimate the degree distributions conditional on \(Y_i>0\)

m_pl = displ$new( y[is.pos] ) # fit powerlaw
est = estimate_xmin(m_pl)# get estimates
m_pl$setXmin(est)# set the minimum vaulue
m_ln = dislnorm$new( y[is.pos] ) # fit lognormal dist 
est_ln = estimate_xmin(m_ln) # set minimum value
m_ln$setXmin(est_ln)
m_pois = dispois$new( y[is.pos] ) # poisson 
m_pois$setXmin(1)
est_pois = estimate_pars(m_pois)
m_pois$setPars(est_pois)

Note that m_pl$setXmin estimates the value of degree \(k_c\) from which the power-law holds.

The program allows us to plot the fitted distrubution, more specifically the ‘survival functions’ \(\Pr(Y > k)\) on a log-log scale.

plot(m_pl)# plots the raw data
lines(m_pl, col='red') # plots the power law
lines(m_ln, col='blue') # plots the lognormal
lines(m_pois, col='green') # plots the poisson

What distribution seems to best fit the number of female sex partners? Two things to note: Poisson has been forced to ignore 0 and, secondly, the power-law has ignored all the information below \(k_c\)

Core discussion partners

Fischer (2009) and Bailey and Marsden (1999) and others (e.g. Marsden 2003; Putnam 2001; McPherson, Smith-Lovin, and Brashears 2006) have used various waves of the GSS to investigate ‘the size’ of peoples’ networks. The size of peoples’ networks has been the target of intense speculation since Robin Dunbar (1992) proposed that there was an evolutionary limit on the number of close contacts that humans (or primates?) can retain in their brain.

The key outcome varaible in these studies has been the number of named core discussion partners

table( GSSdata$numgiven , useNA= 'always')
## 
##    0    1    2    3    4    5    6    9 <NA> 
##  397  281  263  232  128   96   70    5 1340

The value 9 is a missing data code and recall that the number of named contacts is truncated at 6 - a respondent that nominated 6 or more has numgiven coded as 6 but was only allowed to provide 5 names.

As for a sociocentric network, we can plot the (truncated) degree distribution.

GSSdata$numgiven[GSSdata$numgiven>6] <- NA
plot( table( GSSdata$numgiven ) )

Without knowing the structure of the population network we can still estimate the degree distribution

Explaining the degree distribution

As we did for sexual partners, we may try to explain the number of close contacts by using statistical tests. While both sexual partners and close contacts both have skewed degree distributions, there is one notably difference in GSS, namely the range of values. Close contacts range from zero to 6 and sexual partners range from 0 to 900 or so. This limits the use of

  • approximating close contacts using a continuous distribution, and
  • removing isolates (degree 0) removes an informative portion of data.

Regressions

Close contacts might act as social support and alleviate mental health issues (Bryant et al. 2017). Can we test whether better mental health is associated with having more social support?

Create a binary variable indicating whether a respondent has experienced 1 or more days of mental ill health in the past year.

GSSdata$mntlhlth[ GSSdata$mntlhlth > 30 ] <- NA # set as missing
illhealth <- GSSdata$mntlhlth>0 # this will indicate one day or more

OLS

Again, assuming random normal errors

reg.close.lm <- lm(GSSdata$numgiven ~ illhealth )
summary( reg.close.lm )
## 
## Call:
## lm(formula = GSSdata$numgiven ~ illhealth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2925 -1.8109 -0.2925  1.1891  4.1891 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.81087    0.08684  20.854  < 2e-16 ***
## illhealthTRUE  0.48167    0.11679   4.124 4.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.786 on 944 degrees of freedom
##   (1866 observations deleted due to missingness)
## Multiple R-squared:  0.0177, Adjusted R-squared:  0.01666 
## F-statistic: 17.01 on 1 and 944 DF,  p-value: 4.047e-05

Poisson

Using the log-link function

reg.close.po <- glm(GSSdata$numgiven ~ illhealth, family = poisson(link='log') )
summary( reg.close.po )
## 
## Call:
## glm(formula = GSSdata$numgiven ~ illhealth, family = poisson(link = "log"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1413  -1.9031  -0.1976   0.8066   2.4489  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.59381    0.03613  16.435  < 2e-16 ***
## illhealthTRUE  0.23585    0.04625   5.099 3.42e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1750.4  on 945  degrees of freedom
## Residual deviance: 1724.0  on 944  degrees of freedom
##   (1866 observations deleted due to missingness)
## AIC: 3687.2
## 
## Number of Fisher Scoring iterations: 5
par(mfrow=c(1,2))
plot(x=predict(reg.close.lm), y= GSSdata$numgiven[!is.na(GSSdata$numgiven) & !is.na(illhealth)],
     xlab='Predicted Values',
     ylab='Actual Values',
     main='OLS Predicted vs. Actual Values')
abline(a=0, b=1)

plot(x=reg.close.po$fitted.values, y=  GSSdata$numgiven[!is.na(GSSdata$numgiven) & !is.na(illhealth)],
     xlab='Predicted Values',
     ylab='Actual Values',
     main='Po Predicted vs. Actual Values')
abline(a=0, b=1)

The predicted values are just \(\hat{\beta}_0\) and \(\hat{\beta}_0+\hat{\beta}_1\), and \(e^{\hat{\beta}_0}\) and \(e^{\hat{\beta}_0+\hat{\beta}_1}\), for the OLS and Poisson, respectively. Since the Poisson does not predict negative values, the fit is better

par(mfrow=c(2,1))
plot(table(GSSdata$numgiven[illhealth==0]),ylab='',main='Degree distribution healthy')
abline(v=min(reg.close.po$fitted.values),col='red')
lines(mean(GSSdata$numgiven[illhealth==0],na.rm=TRUE),0,pch=15,type='p')
plot(table(GSSdata$numgiven[illhealth==1]),ylab='',main='Degree distribution long term un-healthy')
abline(v=max(reg.close.po$fitted.values),col='red')
lines(mean(GSSdata$numgiven[illhealth==1],na.rm=TRUE),0,pch=15,type='p')

Other tests

All the usual statistical tests are avaialble in R. For example ANOVA https://www.statmethods.net/stats/anova.html.

Ego-alter ties

In addition to naming alters, each ego has also provided some basic information about each of their named alters

Ego-alter tie

Here we will

  • transform data from the wide format to the so-called long format, so that
  • each Ego-Alter pair is represented by a separate case or data point

We will use the GSS 2004

Properties of ‘alters’

The number of named alters per person is given by the variable numgiven, which is the number of close contacts, truncated at 6+. For each of these named contacts, ego is providing descriptions in the form of

Variable name Description Note
SEX1 sex of first named alter this is NA if less than one named
\(\vdots\)
SEX5 sex of fifth named alter this is NA if less than five named
RACE1 sex of first named alter this is NA if less than one named
\(\vdots\)
RACE5 sex of fifth named alter this is NA if less than five named
TALKTO1 Talks to first person important matters this is NA if less than one named
\(\vdots\)
TALKTO5 Talks to first person important matters this is NA if less than five named
EDUC1 education of first named alter this is NA if less than one named
\(\vdots\)
EDUC55 education of fifth named alter this is NA if less than five named

There are also a hoast of questions about commong group membership (e.g. GRPBOTH1 through GRPBOTH5) and religion, etc.

Examples of named alters

Attributes for named alters are reported in the wide format: for each respondent values are reported in ascending order by alter id, variable by variable.

Case no nominated alters

Respondent has not named any alters, so there are no alter properties reported

GSSdata[1,261:270]

Case one nominated alter

Respondent has named exactly one alters, so there are only varible values for ‘SEX1’, ‘RACE1’, etc

GSSdata[11,261:270]
GSSdata$SEX1[11]

Case three nominated altera

Respondent has named exactly three alters, so there are three reports for each of the variables ‘SEX1’, ‘RACE1’, etc.

GSSdata[13,261:270]

These 3 alters are all female

GSSdata[13,271:275]

Case four nominated altera

Respondent has named exactly four alters, so there are four alter values to report on each variable

GSSdata[16,271:275]
GSSdata[16,301:305]
GSSdata[16,306:310]

Homophily

For egos 13, 16 and 42, the age of their alters is given in

GSSdata[c(13,16,42),c(346:350)]

we can thus get the average age in each egonet using

GSSdata$numgiven[GSSdata$numgiven>6] <- NA
ave.age <- rowSums(GSSdata[,c(346:350)], na.rm=TRUE)/GSSdata$numgiven 

Compare the age of ego with

plot( GSSdata$age[GSSdata$numgiven %in% c(1:5)] ,  ave.age[GSSdata$numgiven %in% c(1:5)])

Homophily (McPherson, Smith-Lovin, and Cook 2001) is the process of people choosing contacts that are like themselves. If there is homophily on age, we wold expect the averge age of alter to be predicted to the age of ego

reg.age <- lm( ave.age[GSSdata$numgiven %in% c(1:5)] ~ GSSdata$age[GSSdata$numgiven %in% c(1:5)] )
summary(reg.age)
## 
## Call:
## lm(formula = ave.age[GSSdata$numgiven %in% c(1:5)] ~ GSSdata$age[GSSdata$numgiven %in% 
##     c(1:5)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.966  -7.673  -0.745   5.726  60.715 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               23.91781    1.06635   22.43   <2e-16
## GSSdata$age[GSSdata$numgiven %in% c(1:5)]  0.51310    0.02212   23.19   <2e-16
##                                              
## (Intercept)                               ***
## GSSdata$age[GSSdata$numgiven %in% c(1:5)] ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.59 on 998 degrees of freedom
## Multiple R-squared:  0.3502, Adjusted R-squared:  0.3496 
## F-statistic: 537.9 on 1 and 998 DF,  p-value: < 2.2e-16

We can also investigate if there is homophily on race. Race is coded differently for respondents and alters so we need to recode

ego.white <- GSSdata$race==1
alter.white <- GSSdata[,c(276,  277 , 278 , 279 , 280)]==4
prop.alter.white <- rowSums(alter.white,  na.rm=TRUE)/GSSdata$numgiven
boxplot(prop.alter.white ~ ego.white )

There is clear evidence of homophily on race.

Diversity

In terms of access to resources, it may be important to have access to different types of resources. For each ego we can tabulate the number of alters in each category

alter.education <- cbind(
  rowSums(GSSdata[,c(341:345)]==0, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==1, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==2, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==3, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==4, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==5, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==6, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==7, na.rm=TRUE),
  rowSums(GSSdata[,c(341:345)]==8, na.rm=TRUE))

Based on this we can calculate the proportions \(p_j\) of alters in category \(j\).

alter.education <- alter.education / rowSums(alter.education)# convert to proportions

Simpsons dissimilarity index \[ H = \sum_{j=1}^R p_j \] (Simpson 1949; see discussion of simmilarity indices for networks in Stys et al. 2020)

diss.index <- rowSums(alter.education^2,na.rm=TRUE)

Converting to long format

Long format means that we code each ego-alter variable as a separate observation. Manually we can create a new data frame where alter-id is implied by ‘_1’, ‘_2’, etc.

         d1 <- data.frame(id = GSSdata$id,
         egosex = GSSdata$sex,
         egoage = GSSdata$age,
         sex_1 = GSSdata$SEX1,
         sex_2 = GSSdata$SEX2,
         sex_3 = GSSdata$SEX3,
         sex_4 = GSSdata$SEX4,
         sex_5 = GSSdata$SEX5,
         talkto_1 = GSSdata$TALKTO1,
         talkto_2 = GSSdata$TALKTO2,
         talkto_3 = GSSdata$TALKTO3,
         talkto_4 = GSSdata$TALKTO4,
         talkto_5 = GSSdata$TALKTO5,
         age_1 = GSSdata$AGE1,
         age_2 = GSSdata$AGE2,
         age_3 = GSSdata$AGE3,
         age_4 = GSSdata$AGE4,
         age_5 = GSSdata$AGE5,
         cowork_1 = GSSdata$COWORK1,
         cowork_2 = GSSdata$COWORK2,
         cowork_3 = GSSdata$COWORK3,
         cowork_4 = GSSdata$COWORK4,
         cowork_5 = GSSdata$COWORK5
         )

This is still wide-format, but it is a data frame that only has the subset of variables that we chose. To convert it to ‘long format’, we use the reshape function

egoalter <- reshape(d1, # our wide format data frame
                    dir = "long", # indicate that we want to transfrom into 'long' format
                    varying = 4:23, # variables 4 through 23 vary across egos
                    sep = "_") # this is the charachter used to indicate what ego aler belongs to

# This reshape function does not work on 'haven'-formatted data with values

Have a look at the new, long-format data frame

head(egoalter)
##        id egosex egoage time sex talkto age cowork
## 502.1 502      2     30    1  NA     NA  NA     NA
## 578.1 578      1     32    1  NA     NA  NA     NA
## 600.1 600      2     41    1  NA     NA  NA     NA
## 863.1 863      2     39    1  NA     NA  NA     NA
## 951.1 951      1     61    1  NA     NA  NA     NA
## 991.1 991      1     35    1  NA     NA  NA     NA

Recode talkto to a binary variable capturing ‘almost daily’

egoalter$talkto[egoalter$talkto > 4] <- NA
egoalter$talkto[egoalter$talkto>1] <- 0

Multilevel model of ego-alter ties

Consider now modelling talk to almost daily as a binary variable \(Y_{ij}\), where \(i\) is ego and \(j\) is alter \[ \mathrm{logit} (\Pr(Y_{ij}=1|z_i,x_j))=\alpha+\beta z_i +\gamma x_i + \nu_i, \] where \(\nu_i \thicksim N(0,\tau^2)\) is a random effect for ego \(i\) that is shared across all alters of \(i\).

To do this using a logistic link function we need the package lme4

library(lme4)
 reg.talkto <- glmer(talkto ~ egoage+egosex+sex+age+cowork+(1| id), # regression formula
                     data= egoalter, # name of the data frame that holds the variables
                     family= binomial)
summary(  reg.talkto )
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: talkto ~ egoage + egosex + sex + age + cowork + (1 | id)
##    Data: egoalter
## 
##      AIC      BIC   logLik deviance df.resid 
##   3759.1   3800.7  -1872.5   3745.1     2829 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0978 -0.8585  0.5168  0.7883  1.7916 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  id     (Intercept) 0.6548   0.8092  
## Number of obs: 2836, groups:  id, 1064
## 
## Fixed effects:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.783526   0.350290   5.092 3.55e-07 ***
## egoage      -0.015355   0.003378  -4.545 5.50e-06 ***
## egosex       0.358715   0.101995   3.517 0.000436 ***
## sex          0.211911   0.088154   2.404 0.016223 *  
## age         -0.009880   0.003076  -3.212 0.001318 ** 
## cowork      -0.692913   0.136942  -5.060 4.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr) egoage egosex sex    age   
## egoage -0.299                            
## egosex -0.357  0.047                     
## sex    -0.274  0.002 -0.125              
## age    -0.194 -0.424 -0.022 -0.015       
## cowork -0.668  0.007 -0.086 -0.068 -0.024
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00246418 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

Network regression

Load data

We are looking at the s50 dataset, which is further described here: https://www.stats.ox.ac.uk/~snijders/siena/s50_data.htm

This dataset is available in ziped format online.

temp <- tempfile()
download.file("https://www.stats.ox.ac.uk/~snijders/siena/s50_data.zip",temp)
X <- as.matrix( read.table(unz(temp, "s50-network1.dat")) )
sport <- read.table(unz(temp, "s50-sport.dat"))
smoke <- read.table(unz(temp, "s50-smoke.dat"))
alcohol <- read.table(unz(temp, "s50-alcohol.dat"))
unlink(temp)
n <- nrow(X)
smoke <- (smoke[,2] %in% c(2,3))+0 # use wave 2 and set values of 2 and 3 to smoking and 1 to non-smoking
sport <- sport[,1]-1# dichotomise sporty

Overall task

The variable Alcohol use is coded as

Alcohol value meaning
1 non
2 once or twice a year
3 once a month
4 once a week
5 more than once a week

We are now going to investigate if your friends alcohol use tends to influence your alcohol use. Start with plotting the networks with

  • Node size proportional to alcohol use
  • Colour by smoker (2-3) and non-smoker (1)
gplot(X,vertex.cex=alcohol[,2]/2, vertex.col = ifelse(smoke==1,'red','green') )

Looking at the figure, do you see any evidence of social influence?

Big nodes seem to be tied to other big nodes and small nodes to other small nodes. Smokers also seem to hang out with other smokers.

Ordinary least squares (OLS) regression

Make the assumption that the levels of smoking can be treated as a interval-level, continuous variable. To model the outcomes, we start with assuming that outcomes are independent across students using a regression model

\[ Y_i = \beta_0+\beta_1 m_{i,smoke} + \beta_2 m_{i,sport}+\epsilon_i \]

where * \(\beta_0\) is the intercept * \(\beta_1\) is the average difference in alcohol use for smokers relative to non-smokers * \(\beta_2\) is the average difference in alcohol use for sporty people relative to non-sporty people

and the \(\epsilon_i\)’s are assumed to be independent across \(i\) and follow a normal distribution \(N(0,\sigma^2)\).

Fit OLS

y <- alcohol[,2]
ans.ols <- lm(y ~ smoke+sport)
summary(ans.ols)
## 
## Call:
## lm(formula = y ~ smoke + sport)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0922 -0.6918  0.1690  0.8232  2.5694 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4306     0.3169   7.669 7.98e-10 ***
## smoke         1.4004     0.3084   4.541 3.90e-05 ***
## sport         0.2612     0.3331   0.784    0.437    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.021 on 47 degrees of freedom
## Multiple R-squared:  0.305,  Adjusted R-squared:  0.2755 
## F-statistic: 10.31 on 2 and 47 DF,  p-value: 0.0001933

Question

What conclusions can you draw from the ANOVA table of the OLS regression regarding the regressors smoke and sport?

Smokers have on average a 1.4 higher score on drinking and the coefficient is significantly different from 0 at the 1%-level (and more). There is no evidence of sporty people drinking more or less than less sporty people as the coefficient is not significantly different from 0.

Residuals

We assumed that the errors were normally distributed so the residuals

\[ \hat{e}_i= y_i - \hat{y}= y_i - \hat{\beta}M_i^{\top} \]

should be normally distributed:

hist(ans.ols$residuals)

The residuals seem reasonably ‘normal’

Autocorrelation

We are now going to investigate other properties of the model.

For the sna package, we need to put all covariates, including the constant, in a matrix

\[ M = \begin{pmatrix} M_{1}\\ M_{2}\\ \vdots\\ M_{n}\\ \end{pmatrix}= \begin{pmatrix} 1 & m_{1,1} & \cdots & m_{Z,p}\\ 1 & m_{2,1} & \cdots & m_{Z,p}\\ \vdots & \vdots & \vdots& \vdots\\ 1 & m_{n,1} & \cdots & m_{n,p}\\ \end{pmatrix} \]

We thus put all covariates into the same matrix:

M <- cbind(matrix(1,n,1),smoke,sport)
colnames(M) <- c('cons','smoke','sport')
head(M)
##      cons smoke sport
## [1,]    1     0     1
## [2,]    1     1     0
## [3,]    1     0     0
## [4,]    1     0     1
## [5,]    1     0     1
## [6,]    1     0     1

Let us investigate whether the residuals are independent across the network. In particular, if the outcome of \(i\) is completely independent of the outcome of \(j\) then we would not expect there to be a difference between \((\hat{e}_i - \bar{e})(\hat{e}_j-\bar{e})\) for a pair that is connected through a ties \(x_{ij}=1\) and a pair \((\hat{e}_i - \bar{e})(\hat{e}_k-\bar{e})\) that is not connected, \(x_{ik}=0\). In other words, higher (lower) than predicted values should not be expected if a network partner has higher (lower) than predicted value. Note that the average \(\bar{e}=0\) by construction and was just included for clarity. Moran’s I does measure exactly whether connected partners tend to have more similar residuals than unconnected people. In terms of the outcome variable

\[ I_k =\frac{n}{\sum_{i=1}^n\sum_{j=1}^n X_{ij}^{(k)}} \frac{\sum_{i=1}^n \sum_{j=1}^n (y_i-\bar{y}) (y_j-\bar{y})X_{ij}^{(k)} }{\sum_{j=1}^n (y_j-\bar{y})^2} \]

Where \(X_{ij}^{(k)}\) is the \(k\)-step adjacency matrix, i.e. \(X_{ij}^{(1)}\) is the matrix of people that are directly connected, \(X_{ij}^{(2)}\) is the matrix of people that are connected at distance 2, etc. The step \(k\) is sometimes called lag. Intuitively, if \(X_{ij}^{(k)}\) doesn’t matter, then many cross-product terms should cancel others out. It can be shown that, under the assumption that there is no network correlation at lag \(k=1\), then the expected value is

\[ E(I_1) = \frac{-1}{N-1} \]

Here we want to check if this seems to hold true for our residuals - are they uncorrelated across network partners?

nacf(X,ans.ols$residuals,type="moran")
##           0           1           2           3           4           5 
##  1.00000000  0.19495562 -0.17048356  0.02784058 -0.04684902 -0.01678792 
##           6           7           8           9          10          11 
##  0.03002702 -0.06753150 -0.10455140  0.00000000  0.00000000  0.00000000 
##          12          13          14          15          16          17 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          18          19          20          21          22          23 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          24          25          26          27          28          29 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          30          31          32          33          34          35 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          36          37          38          39          40          41 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          42          43          44          45          46          47 
##  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000  0.00000000 
##          48          49 
##  0.00000000  0.00000000
# nacf(X,ans.ols$residuals,type="geary") # Geary's C is another measure of autocorrelation but is harder to undertand

That \(I_0=1\) is because this is the correlation of residuals with themselves!

We are only really interested in shorter lags, so let us plot the first 4 and add a reference line for \(E(I_1)\) under no correlation

plot(nacf(X,ans.ols$residuals,type="moran")[1:4],type='b')
abline(h=-1/(n-1),col='red')

Row-nomalise

When modelling social influence we may want scale the network ties to take into account how many ties people have. Leenders (2002) propose a number of ways in which we can scale the adjacency matrix. Here we want each arc to have the weight

\[ W_{ij} = X_{ij}/d_{i,out} \]

We can do this easily BUT we have to be careful so that we do not divide by 0 (we define \(0/0\) as \(0\))

degrees <- degree(X,outdegree)
W <- X
W[degrees>0,] <- X[degrees>0,]/degrees[degrees>0]

# You can check that we are actually normalising correctly
# Wcopy <- X
# for (i in c(1:n))
# {
#  if (degrees[i]>0)
#  {
#    Wcopy[i,] <- X[i,]/degrees[i]
#  }
#  
#}
#sum( W != Wcopy)

Check residulas again

Check the residuals again

plot(nacf(W,ans.ols$residuals,type="moran")[1:4],type='b')

Do we see any difference? In the formula for \(I_k\) the scaling factors \(d_i\) will actually cancel out.

Network autocorrelation model

The network autocorrelation model, or network disturbances model, looks exactly like the OLS \[ y_i = M_i \beta + \epsilon_i\text{,} \]

but we no longer assume that the residuals are independent. Instead, we induce network autoocorrelation on the error terms

\[ \epsilon_i = \rho \sum_{j=1}^n W_{ij}\epsilon_j+\xi_i \]

(\(\xi\) is the Greek letter Xi - https://en.wikipedia.org/wiki/Xi_(letter)). The error tems \(\xi_i\) are assumed to be independent and identically distributed \(\xi_i \thicksim N(0,\sigma^2)\). The interpretation is that if \(i\) has a higher than predicted value on the outcome variable then \(j\) is more likely to also have higher than predicted values for all \(j\) that \(i\) has nominated.

If you know a bit of matrix algebra, you will notice that we can write the vector of disturbances $= (_1,,_n)^{} $ as

\[ \epsilon = \rho W \epsilon + \xi \]

One immediate issue here is that we have \(\epsilon\) on both sides of the equation. We can simplify this expression by solving for \(\epsilon\)

\[ \epsilon = (I-\rho W)^{-1}\xi \]

You can interpret this as the error terms \(\xi_i\) ‘spreading’ on the network.

Fit network autocorrelation model

The function lnam (which stands for linear network autocorrelation models, pressumably) can fit this model. The formula is specified in terms of the outcome y variable and the matrix x of covariates. The weight matrix is specified as W2.

netacm.1 <-lnam(y=y, x=M, W2=X)
## Loading required namespace: numDeriv
summary(netacm.1)
## 
## Call:
## lnam(y = y, x = M, W2 = X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6676 -0.6647  0.3353  1.1311  2.5269 
## 
## Coefficients:
##        Estimate Std. Error Z value Pr(>|z|)    
## cons    2.47307    0.33278   7.431 1.07e-13 ***
## smoke   1.00287    0.35437   2.830  0.00465 ** 
## sport   0.19162    0.30839   0.621  0.53438    
## rho2.1  0.16432    0.07098   2.315  0.02061 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       Estimate Std. Error
## Sigma   0.9279      0.009
## 
## Goodness-of-Fit:
##  Residual standard error: 1.061 on 46 degrees of freedom (w/o Sigma)
##  Multiple R-Squared: 0.1757, Adjusted R-Squared: 0.104
##  Model log likelihood: -68.42 on 45 degrees of freedom (w/Sigma)
##  AIC: 146.8 BIC: 156.4 
## 
##  Null model: meanstd 
##  Null log likelihood: -79.54 on 48 degrees of freedom
##  AIC: 163.1 BIC: 166.9 
##  AIC difference (model versus null): 16.24 
##  Heuristic Log Bayes Factor (model versus null):  10.51

Did the conclusions about the regressors smoke and sport change?

The difference between smokers and non-smokers remain but the effect is somewhat smaller.

Is there evidence of network autocorrelation?

The network autocorrelation (must be) \(\rho\) is here rho2.1 and is estimated to 0.16 and is statistically significantly different from zero on the 5%-level using a one-sided test.

Looking at the network plot, it almost looks as if high-degree nodes drink more. To take this into account, we add outdegree as an explanatory variable:

M <- cbind(matrix(1,n,1),degrees,smoke,sport )
colnames(M) <- c('cons','degree','smoke','sport')
head(M)
##      cons degree smoke sport
## [1,]    1      3     0     1
## [2,]    1      4     1     0
## [3,]    1      4     0     0
## [4,]    1      4     0     1
## [5,]    1      2     0     1
## [6,]    1      2     0     1

Now re-run the network autocorrlation model (call the output netacm.2 to align with the code in the code chunk below)

netacm.2 <-lnam(y=y, x=M, W2=W)
summary(netacm.2)
## 
## Call:
## lnam(y = y, x = M, W2 = W)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91118 -0.71108  0.09031  0.60982  2.34177 
## 
## Coefficients:
##        Estimate Std. Error Z value Pr(>|z|)    
## cons    2.12466    0.39409   5.391    7e-08 ***
## degree  0.08893    0.05627   1.580 0.113994    
## smoke   1.07210    0.31516   3.402 0.000669 ***
## sport   0.25294    0.29605   0.854 0.392900    
## rho2.1  0.66227    0.29190   2.269 0.023278 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       Estimate Std. Error
## Sigma   0.9115      0.009
## 
## Goodness-of-Fit:
##  Residual standard error: 1.032 on 45 degrees of freedom (w/o Sigma)
##  Multiple R-Squared: 0.2479, Adjusted R-Squared: 0.1643
##  Model log likelihood: -67.27 on 44 degrees of freedom (w/Sigma)
##  AIC: 146.5 BIC: 158 
## 
##  Null model: meanstd 
##  Null log likelihood: -79.54 on 48 degrees of freedom
##  AIC: 163.1 BIC: 166.9 
##  AIC difference (model versus null): 16.55 
##  Heuristic Log Bayes Factor (model versus null):  8.902

Does accounting for outdegree change the conclusions about network autocorrelation?

No, the coefficient for degree itselft is not statistically significantly different from 0. The magnitude of the network autocorrelation paramter is increased substantially.

Network Effects model

The network effects model assumes that there is correlation through the outcome variable \(Y\). This subtly different from correlation only through the error term. The outcome \(Y_i\) for \(i\) is a a weighted sum of the values \(Y_j\) of the contacts of \(i\)

\[ Y_i = \rho \sum_{j=1}^n W_{ij}Y_i + M_i\beta+\epsilon_i \]

where \(\xi_i \thicksim N(0,\sigma^2)\) independently for all \(i\).

If you want to model social influence or social contagion, the network effects model is more approprate than the autocorrelation model as the former models the dependence between values on the outcome varaible

The equation for the full outcome vector can, as before, be written compactly as

\[ Y = \rho W Y + M\beta+\epsilon \]

Here it is there is feedback on the outcome variable.

Stationary distribution

We can derive this model out of a longitudinal model

\[ Y_{i,t} = \rho \sum_{j=1}^n W_{ij}Y_{j,t-1}+M\beta+\epsilon_{i,t} \]

as \(t\) tends to infinity.

For a quick illustration, let us look at the first 5 interaction of the updating accourding to the longitudinal form of the model

rho <- 0.4# set a strong influence paramter
beta <- matrix(summary(netacm.2)$beta[,1],4,1)# use the estimated coefficients from the last regression
Y <- matrix(rnorm(n),n,1)# random starting point
par(mfrow=c(2,3), oma = c(0,1,0,0) + 0.1,mar = c(1,0,1,1) + 0.1)
coord <- gplot(X,gmode='graph',vertex.cex = Y/2,main= paste('M ',round(nacf(W,Y[,1],type="moran")[2],3) ) )
for (k in c(1:5)){
Y <- rho * W %*% Y + M %*% beta + matrix(rnorm(n),n,1)# update according to equation; %*% for matrix multiplication
gplot(X,gmode='graph',vertex.cex = Y/2, coord=coord, main= paste('M ',round(nacf(W,Y[,1],type="moran")[2],3) ) )
}

We may or may not see increased corrlation in the outcomes. Running 100 iterations this will become a little more evident

moran <- matrix(0,100,3)
rho <- .4
Y <- matrix(rnorm(n),n,1)
for (k in c(1:100)){
Y <- rho * W %*% Y + M %*% beta + matrix(rnorm(n),n,1)
moran[k,] <- nacf(W,Y[,1],type="moran")[2:4]}
par(mfrow=c(1,1))
plot(moran[,1],type='l')
lines(moran[,2],col='red')
lines(moran[,3],col='blue')
abline( h = -1/(n-1), col='grey')

We might see a slight trend upwards and correlations are generally speaking positive.

As before \(Y\) on both sides of the equation, but we can solve for \(Y\)

\[ Y = (I- \rho W)^{-1}(M\beta+\epsilon) \]

The same function as before, lnam is used to fit this model. The difference from the network autocorrelation model is that the weight matrix is specified as W1 not W2.

neteff.2 <-lnam(y=y, x=M, W1=W)
summary(neteff.2)
## 
## Call:
## lnam(y = y, x = M, W1 = W)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04530 -0.70707  0.08996  0.74132  2.41599 
## 
## Coefficients:
##        Estimate Std. Error Z value Pr(>|z|)    
## cons    1.78401    0.42224   4.225 2.39e-05 ***
## degree  0.07223    0.05714   1.264 0.206154    
## smoke   1.14399    0.31259   3.660 0.000253 ***
## sport   0.19588    0.31403   0.624 0.532785    
## rho1.1  0.29178    0.17255   1.691 0.090838 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       Estimate Std. Error
## Sigma   0.9404      0.009
## 
## Goodness-of-Fit:
##  Residual standard error: 1.025 on 45 degrees of freedom (w/o Sigma)
##  Multiple R-Squared: 0.3229, Adjusted R-Squared: 0.2477
##  Model log likelihood: -68.05 on 44 degrees of freedom (w/Sigma)
##  AIC: 148.1 BIC: 159.6 
## 
##  Null model: meanstd 
##  Null log likelihood: -79.54 on 48 degrees of freedom
##  AIC: 163.1 BIC: 166.9 
##  AIC difference (model versus null): 14.98 
##  Heuristic Log Bayes Factor (model versus null):  7.336

There is no evidence for social influence on alcohol

Simplify model

Fit a regression that only includes (the intercept) smoke and that uses the original, non-normalised, adjacency matrix

neteff.2 <-lnam(y=y, x=M[,c(1,3)], W1=X)
summary(neteff.2)
## 
## Call:
## lnam(y = y, x = M[, c(1, 3)], W1 = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91250 -0.60102 -0.06224  0.75113  2.24134 
## 
## Coefficients:
##        Estimate Std. Error Z value Pr(>|z|)    
## cons    2.21840    0.25578   8.673  < 2e-16 ***
## smoke   1.10413    0.30849   3.579 0.000345 ***
## rho1.1  0.06896    0.03219   2.142 0.032153 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       Estimate Std. Error
## Sigma   0.9503      0.009
## 
## Goodness-of-Fit:
##  Residual standard error: 1.013 on 47 degrees of freedom (w/o Sigma)
##  Multiple R-Squared: 0.3161, Adjusted R-Squared: 0.2724
##  Model log likelihood: -68.59 on 46 degrees of freedom (w/Sigma)
##  AIC: 145.2 BIC: 152.8 
## 
##  Null model: meanstd 
##  Null log likelihood: -79.54 on 48 degrees of freedom
##  AIC: 163.1 BIC: 166.9 
##  AIC difference (model versus null): 17.9 
##  Heuristic Log Bayes Factor (model versus null):  14.07

Question

What do you conclude from the results? Can you relate this to total and average exposure in the diffusion model?

When the the matrix is not row-normlised there is evidence of social influence. This means that it is the number of people that exert influence on you and not the proportion of friends.

References

Bailey, Stefanie, and Peter V Marsden. 1999. “Interpretation and Interview Context: Examining the General Social Survey Name Generator Using Cognitive Methods.” Social Networks 21 (3): 287–309.
Barabási, Albert-László, and Réka Albert. 1999. “Emergence of Scaling in Random Networks.” Science 286 (5439): 509–12.
Borgatti, Stephen P, and Martin G Everett. 2006. “A Graph-Theoretic Perspective on Centrality.” Social Networks 28 (4): 466–84.
Borgatti, Stephen P, Martin G Everett, and Jeffrey C Johnson. 2018. Analyzing Social Networks. Sage.
Bryant, Richard A, H Colin Gallagher, Lisa Gibbs, Philippa Pattison, Colin MacDougall, Louise Harms, Karen Block, et al. 2017. “Mental Health and Social Networks After Disaster.” American Journal of Psychiatry 174 (3): 277–85.
Burt, Ronald S. 2009. Structural Holes: The Social Structure of Competition. Harvard university press.
Butts, Carter T. 2015. _Network: Classes for Relational Data_. The Statnet Project. https://CRAN.R-project.org/package=network.
———. 2016. _Sna: Tools for Social Network Analysis. https://CRAN.R-project.org/package=sna.
Coleman, James Samuel. 1964. “Introduction to Mathematical Sociology.” Introduction to Mathematical Sociology.
Crossley, Nick, Elisa Bellotti, Gemma Edwards, Martin G Everett, Johan Koskinen, and Mark Tranmer. 2015. Social Network Analysis for Ego-Nets: Social Network Analysis for Actor-Centred Networks. Sage.
Davis, J. A. 1967. “Clustering and Structural Balance in Graphs.” Human Relations 20: 181–87.
Davis, Mark H et al. 1980. “A Multidimensional Approach to Individual Differences in Empathy.”
Dunbar, Robin IM. 1992. “Neocortex Size as a Constraint on Group Size in Primates.” Journal of Human Evolution 22 (6): 469–93.
Fischer, Claude S. 2009. “The 2004 GSS Finding of Shrunken Social Networks: An Artifact?” American Sociological Review 74 (4): 657–69.
Freeman, Linton C. 1978. “Centrality in Social Networks Conceptual Clarification.” Social Networks 1 (3): 215–39.
Hanneman, Robert A., and Mark Riddle. 2005. Introduction to Social Network Methods. Riverside, CA: University of California, Riverside. http://faculty.ucr.edu/~hanneman/.
Holland, P. W., and S. Leinhardt. 1972. “Local Structure in Social Networks.” In Sociological Methodology, edited by Heise D., 2:1–45. 5th Series. San Francisco, CAn: Jossey-Bass.
Igarashi, Tasuku, and Johan Koskinen. 2020. “Overchoosing: A Mechanism of Tie-Formation in Social Networks.” https://psyarxiv.com/47q39.
Liljeros, Fredrik, Christofer R Edling, Luis A Nunes Amaral, H Eugene Stanley, and Yvonne Åberg. 2001. “The Web of Human Sexual Contacts.” Nature 411 (6840): 907–8.
Marsden, Peter V. 2003. “Interviewer Effects in Measuring Network Size Using a Single Name Generator.” Social Networks 25 (1): 1–16.
McPherson, Miller, Lynn Smith-Lovin, and Matthew E Brashears. 2006. “Social Isolation in America: Changes in Core Discussion Networks over Two Decades.” American Sociological Review 71 (3): 353–75.
McPherson, Miller, Lynn Smith-Lovin, and James M Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology 27 (1): 415–44.
Morris, Martina. 2004. Network Epidemiology: A Handbook for Survey Design and Data Collection. Oxford University Press on Demand.
Padgett, John F, and Christopher K Ansell. 1993. “Robust Action and the Rise of the Medici, 1400-1434.” American Journal of Sociology 98 (6): 1259–319.
Putnam, Robert. 2001. “Social Capital: Measurement and Consequences.” Canadian Journal of Policy Research 2 (1): 41–51.
Robins, G. L., Pattison P. E., and J. H. Koskinen. 2008. “Network Degree Distributions.” MelNet Social Networks Laboratory Technical Report 08-02. Department of Psychology, School of Behavioural Science, University of Melbourne.
Robins, Garry. 2015. Doing Social Network Research: Network-Based Research Design for Social Scientists. Sage.
Rosenberg, Morris. 2015. Society and the Adolescent Self-Image. Princeton university press.
Rotter, Julian B. 1966. “Generalized Expectancies for Internal Versus External Control of Reinforcement.” Psychological Monographs: General and Applied 80 (1): 1.
Scheier, Michael F, Charles S Carver, and Michael W Bridges. 1994. “Distinguishing Optimism from Neuroticism (and Trait Anxiety, Self-Mastery, and Self-Esteem): A Reevaluation of the Life Orientation Test.” Journal of Personality and Social Psychology 67 (6): 1063.
Simpson, Edward H. 1949. “Measurement of Diversity.” Nature 163 (4148): 688–88.
Spiegelhalter, David. 2015. Sex by Numbers: What Statistics Can Tell Us about Sexual Behaviour. Profile books LTD, London.
Stys, Patrycja, Judith Verweijen, Papy Muzuri, Samuel Muhindo, Christoph Vogel, and Johan H Koskinen. 2020. “Brokering Between (Not so) Overt and (Not so) Covert Networks in Conflict Zones.” Global Crime 21 (1): 74–110.
Yamagishi, Toshio. 1986. “The Provision of a Sanctioning System as a Public Good.” Journal of Personality and Social Psychology 51 (1): 110.